Sunday, January 1, 2017

Using KDTrees in Apache Spark

OR: The best Coffee Shop in Hong Kong to catch Pokemon

I work with spatial data all the time, and one of the most common things I do with spatial data is find the nearest locations between two sets of objects. For example, in the context of Pokemon Go, you might ask, "what is the nearest Pokestop to a given Pokemon?" The standard way to do this is to use a data structure called a KDTree.

In the past six months, I have started using Apache Spark, and quickly grown to love it. However, I haven't found any good tutorials on how to use KDTrees in Spark. To fill the void, I have written a short tutorial on how to use scipy KDTrees in Spark. The tutorial covers how to load Pokemon location information from Hong Kong, why KDTrees are great, how to create a KDTree of coffee shops in Hong Kong, and the code to combine them using Spark. I wrote the tutorial as a Jupyter notebook, but haven't figured out how to embed those in Blogger, so head over to Github for a gander.

If you want a sneak preview, this is how I define the udf which does the query:
coffee_udf = F.udf( partial(query_kdtree, cur_tree = coffee_tree_broadcast),
                    T.ArrayType( T.IntegerType() ) )

Thursday, March 31, 2016

A simple GUI for analyzing BioDAQ data

The Palmiter lab often monitors food and water intake in response to a variety of stimuli. To quantify these measurements, we house mice in BioDAQ chambers, which will record how much a mouse eats or drinks down to 0.01g. While the chambers are nice, the software that comes with them is terrible. In addition to being slow, it outputs data for each cage separately, which means you get to enjoy combining Excel files. After fighting the software, I decided I could do better, and made a small GUI.

Screenshot of the BioDAQ software. Note the Windows-95 era aesthetic. You can record up to 32 scales at a time. To get a single scale, you have to unclick the other 31! Data for each cage is saved individually.


The goal of the GUI was to be able to analyze many cages over multiple days, and output a single file containing the data for further analysis.

The GUI starts by loading a .tab file which contains the feeding data. The code for this was actually easy, as the data is just a tab-delimited text file. (It takes a little longer than I expected, a few seconds, due to datetime parsing.)

Once the data is loaded, the GUI asks the users for information about which cages to analyze, which dates, times, and how to bin the data by time. If you are interested in the data from 10 cages, binned at one hour increments, over 5 days, you can simply input those numbers. Once everything is set, you can then save the data to a .csv, which will have the same base filename as the input data. The .csv will contain columns for:

date and time
cage id
number of feeding bouts
average bout duration (in seconds)
total eaten (in grams)
number of meals
average meal duration (in seconds)
average meal size (in grams)

Left: The GUI. You can choose a range of cages, dates, times, and more. Right: Output CSV for the file given the parameters on the left. There is information for total food eaten, number of bouts of eating, number of meals, and duration for each of those. If a cage or time does not have information, the row will be blank.

If you use BioDAQs to measure feeding, and are similarly frustrated with the software, you can give this GUI a try! You only need to download two python files, and To run the GUI, you can use either python 2.7 or 3+ (the Anaconda install should have all the relevant modules). Just open a command prompt, and type:


I suggest comparing the output of the GUI to some pre-analyzed data, so you can verify that it works. If you find this helpful, let me know!

Monday, March 28, 2016

A simple GUI for analyzing thermal images

One of the grad students in the lab has started a project on thermoregulation, and he measures mouse tail temperature using an infrared camera from FLIR. FLIR has an analysis tool for its cameras which works OK, but is not really designed for analyzing hundreds of images. To save him some time, I made a simple GUI for analyzing thermal images. In this post I'm going to outline the design of the gui, and how to use it, in case anyone else needs to analyze lots of thermal images.

Creating temperature images

The FLIR camera stores images in pseudocolor jpegs that look like this:

However, these jpegs do not contain the actual temperature data. To figure out where the temperature data was, I consulted this thread from two years ago about FLIR images. I learned that the data is actually contained in the EXIF for the jpeg, is only 80x60 pixels (compared to the jpeg's 320x240), contains intensity data (not temperature), and that the data was stored with the wrong -endian. Luckily, the thread contained enough details that I was able to figure out how to extract the image from the EXIF using exiftool, and was able to switch the endian using ImageMagick.

Once I had the imaging data, I then needed to convert it to temperature. Here the thread came in handy again, specifically this post which outlined how to convert radiance to temperature. All of the constants for the equation are also stored in the EXIF of the jpeg, which allowed me to calculate the temperature for each point.


Once I was able to calculate true-temperature images, it was time to make the GUI! Previously I've made GUIs using QT Designer, but I found an outline of a tkinter script that records the x,y coordinates whenever someone clicks on an image, so I decided to modify that instead. For the image to click on, I decided to go with a grayscale version of the pseudocolor jpeg, as it looks a lot nicer.

To use the GUI, you need to install python 3.4, exiftool and ImageMagick. Then to run the gui, open a command prompt, and go to the directory with the images, then execute:


Grayscale version you can click on! The GUI uses the .jpg for display, but loads the temperature data in the background.

Once it opens, simply click on the pixel you want the temperature of, and the GUI will output the temperature of that pixel on the command line. If you are analyzing a bunch of images, you can hit space to go to the next image, or 'q' to quit. When you are done, the GUI will save a .csv containing the names of each image, and the temperature for that image.

Exciting screenshot of a CSV! Temperatures are in Celsius.
If this GUI sounds interesting to you, you can download the script. That folder contains: 1) a README.txt explaining how to install everything, and intructions on how to run the GUI; and 2) the script for the GUI, Everything was written in python 3. If you are using the script on Windows you may have to install packages for tkinter and image. If you have any problems, please contact me, as I helped someone else in the lab set it up.

Saturday, January 16, 2016

Introducing the mechanisms Twitter bot

When describing what we know about the world, scientists often have to state what we don't know. Rather than simply stating, "We don't know how X works," scientists (and especially biologists) have come up with the beautiful syntax, "The mechanisms underlying X are not yet understood." Why use five syllables when you could use 14! In a previous blog post, I explored the history of this syntax, and found its basic form dates to 1950. Later, I invented the Mechanisms rating system, the number of sentences until the mechanisms syntax is used. Since I have grown more interested in data science, I decided to look at the mechanisms syntax from a medium data perspective.

So today, I'm going to present the highlights of 25,000 abstracts that contain mechanisms syntax. Then I will introduce a twitter bot that will once a day tweet out a new mechanism that we don't understand.

Some quick stats about mechanisms

I started by querying Pubmed for 100,000 abstracts containing both a mechanism word like "mechanism" or "pathway", and a clarity word like "unknown." I then filtered the abstracts for those which used a mechanism and clarity word in the same sentence, yielding a final pool of 25,000 abstracts. If you want to see a bunch of these sentences, you can visit the Jupyter notebook for this project.

The shortest sentence in this corpus is concise, "The mechanisms are unclear."

The longest sentence highlights the beauty of the syntax, how you can write dozens of words about what we do know, then switch it up and talk about what we don't know:
Although an increasing number of studies have identified misregulated miRNAs in the neurodegenerative diseases (NDDs) Alzheimer's disease, Parkinson's disease, Huntington's disease, and amyotrophic lateral sclerosis, which suggests that alterations in the miRNA regulatory pathway could contribute to disease pathogenesis, the molecular mechanisms underlying the pathological implications of misregulated miRNA expression and the regulation of the key genes involved in NDDs remain largely unknown. PMID: 26663180
If you didn't feel like reading that, I'll summarize: micro RNAs are important for brain diseases, but we don't know how they work.

For some summary stats of the 25,000 sentences:
  • 19,500 ended with some variant of "unknown"
  • 8,900 included "however"
  • 2,800 used "although"
  • 600 used "while"

Mechanisms Twitter bot

Since I can never get enough of these sentences, I made a twitter bot to find them for me. Each day, the bot queries Pubmed for new abstracts using the syntax, and tweets it with a link.

While this was a fun side project, when you look at thousands of these sentences, they form a sort of catalogue of everything we don't know. Biofilms, leukocytes, 6-gingerol... For many of these papers, I would have to spend some time to figure out what we do and don't know.

So if you want a fun daily reminder of everything we don't know, please follow.

Monday, December 28, 2015

Exploring random forest hyper-parameters using a League of Legends dataset

Over the last few blog posts, I have used random forests to investigate data from the game League of Legends. In this last post, I will explore model optimization. Specifically I will look at how hyper-parameters like forest size, and node size can influence classification accuracy, show that dimensionality reduction doesn't help random forests, and compare random forest performance to Naive Bayes. For details of this post, please look at this Jupyter notebook.


As a refresher, League of Legends is an online multiplayer game where teams try to destroy each others' base. The aim of this project is to predict the eventual winner of a game at early timepoints, using game conditions; for example, I would like to be able to say the Blue Team with a 10,000 gold lead has an X% chance of winning, and determine X by machine learning. I gathered data from Riot Games's API, and created pandas dataframes containing information from 30,000 games at 5 minute intervals. For examples, you can see previous posts. In this post, as mentioned above, I will explore how I optimized my random forest model.

Forest size

The basic task my random forest is trying to do is predict the winner of a game of League of Legends, a classification task. Random forests work by generating a large number of decision trees, and averaging the results across trees. Thus one of the most obvious random forest parameters to optimize is the number of trees in the forest. To look at this, I ran the prediction algorithm many times, with different number of trees for each run.

As you increase the number of trees in the forest, the accuracy improves, until it plateaus around 25 trees. I am using around a dozen features in my dataset, which makes me hypothesize that your forest should have a number of trees equal to twice the number of features. It would be interesting to look at datasets with more features to see how they perform, and see how general this is.

Node size

The software package I am using for building random forests, scikit-learn, by default creates decision trees where each leaf is pure (that is, each group contains only wins or losses, and no mixture of wins and losses). This can lead to overfitting of individual trees, and by extension the random forest. However, you can set a parameter in the random forest to increase the minimum leaf size, or the minimum size of nodes for splitting. To see how this influenced prediction accuracy, I ran the prediction algorithm over a range of minimum node sizes.

The model accuracy goes up as the minimum sample size increases, plateauing around a 200 sample minimum. The model then maintains its accuracy for larger minimums, before decreasing once the minimum size is 5,000, or approximately 15% of the data. Once the minimum is that large, the trees are restricted in their depth, and cannot make enough splits to separate the data.

Rather than making graphs like this for every parameter, I ran a grid search over tree depth, leaf size, and node size. The best parameters for my dataset with 30,000 games, and a dozen features was a maximum depth of 10 levels, minimum leaf size of 10, and minimum split size of 1000. Optimizing these parameters yielded a 2-5% increase in accuracy over the default parameters (over a range of timepoints).

Dimensionality reduction

There is a lot of collinearity in my dataset, which can be a problem for certain machine learning algorithms; theoretically, random forests are resilient to collinearity. To see this for myself, I performed dimensionality reduction, specifically PCA, on my features, and then ran the random forest algorithm again. Here I am presenting the data slightly differently, in terms of prediction accuracy at specific times within the game.
The PCA model is actually a little worse! In the past I have used PCA on data before performing regression, and which improved regression performance by 5%. It's good to know that random forests work as advertised, and don't require dimentionality reduction beforehand.

Naïve Bayes

Part of the reason I did this project was to gain experience with random forests, and compare their performance with other algorithms. Naïve Bayes is a decent, fast default algorithm which is often used for benchmarking. I hadn't used it previously, as there is a little bit of manipulation necessary to combine categorical and continuous features in Naïve Bayes (like 5 lines of code!). As a final experiment, I wanted to see how Naïve Bayes compared to Random Forests.
Naïve Bayes performs almost as well! And runs a whole lot faster! What is probably happening here is that the dataset is just not deep enough for the strength of random forests to shine.