Thursday, March 31, 2016

A simple GUI for analyzing BioDAQ data

The Palmiter lab often monitors food and water intake in response to a variety of stimuli. To quantify these measurements, we house mice in BioDAQ chambers, which will record how much a mouse eats or drinks down to 0.01g. While the chambers are nice, the software that comes with them is terrible. In addition to being slow, it outputs data for each cage separately, which means you get to enjoy combining Excel files. After fighting the software, I decided I could do better, and made a small GUI.

Screenshot of the BioDAQ software. Note the Windows-95 era aesthetic. You can record up to 32 scales at a time. To get a single scale, you have to unclick the other 31! Data for each cage is saved individually.


The goal of the GUI was to be able to analyze many cages over multiple days, and output a single file containing the data for further analysis.

The GUI starts by loading a .tab file which contains the feeding data. The code for this was actually easy, as the data is just a tab-delimited text file. (It takes a little longer than I expected, a few seconds, due to datetime parsing.)

Once the data is loaded, the GUI asks the users for information about which cages to analyze, which dates, times, and how to bin the data by time. If you are interested in the data from 10 cages, binned at one hour increments, over 5 days, you can simply input those numbers. Once everything is set, you can then save the data to a .csv, which will have the same base filename as the input data. The .csv will contain columns for:

date and time
cage id
number of feeding bouts
average bout duration (in seconds)
total eaten (in grams)
number of meals
average meal duration (in seconds)
average meal size (in grams)

Left: The GUI. You can choose a range of cages, dates, times, and more. Right: Output CSV for the file given the parameters on the left. There is information for total food eaten, number of bouts of eating, number of meals, and duration for each of those. If a cage or time does not have information, the row will be blank.

If you use BioDAQs to measure feeding, and are similarly frustrated with the software, you can give this GUI a try! You only need to download two python files, and To run the GUI, you can use either python 2.7 or 3+ (the Anaconda install should have all the relevant modules). Just open a command prompt, and type:


I suggest comparing the output of the GUI to some pre-analyzed data, so you can verify that it works. If you find this helpful, let me know!

Monday, March 28, 2016

A simple GUI for analyzing thermal images

One of the grad students in the lab has started a project on thermoregulation, and he measures mouse tail temperature using an infrared camera from FLIR. FLIR has an analysis tool for its cameras which works OK, but is not really designed for analyzing hundreds of images. To save him some time, I made a simple GUI for analyzing thermal images. In this post I'm going to outline the design of the gui, and how to use it, in case anyone else needs to analyze lots of thermal images.

Creating temperature images

The FLIR camera stores images in pseudocolor jpegs that look like this:

However, these jpegs do not contain the actual temperature data. To figure out where the temperature data was, I consulted this thread from two years ago about FLIR images. I learned that the data is actually contained in the EXIF for the jpeg, is only 80x60 pixels (compared to the jpeg's 320x240), contains intensity data (not temperature), and that the data was stored with the wrong -endian. Luckily, the thread contained enough details that I was able to figure out how to extract the image from the EXIF using exiftool, and was able to switch the endian using ImageMagick.

Once I had the imaging data, I then needed to convert it to temperature. Here the thread came in handy again, specifically this post which outlined how to convert radiance to temperature. All of the constants for the equation are also stored in the EXIF of the jpeg, which allowed me to calculate the temperature for each point.


Once I was able to calculate true-temperature images, it was time to make the GUI! Previously I've made GUIs using QT Designer, but I found an outline of a tkinter script that records the x,y coordinates whenever someone clicks on an image, so I decided to modify that instead. For the image to click on, I decided to go with a grayscale version of the pseudocolor jpeg, as it looks a lot nicer.

To use the GUI, you need to install python 3.4, exiftool and ImageMagick. Then to run the gui, open a command prompt, and go to the directory with the images, then execute:


Grayscale version you can click on! The GUI uses the .jpg for display, but loads the temperature data in the background.

Once it opens, simply click on the pixel you want the temperature of, and the GUI will output the temperature of that pixel on the command line. If you are analyzing a bunch of images, you can hit space to go to the next image, or 'q' to quit. When you are done, the GUI will save a .csv containing the names of each image, and the temperature for that image.

Exciting screenshot of a CSV! Temperatures are in Celsius.
If this GUI sounds interesting to you, you can download the script. That folder contains: 1) a README.txt explaining how to install everything, and intructions on how to run the GUI; and 2) the script for the GUI, Everything was written in python 3. If you are using the script on Windows you may have to install packages for tkinter and image. If you have any problems, please contact me, as I helped someone else in the lab set it up.

Saturday, January 16, 2016

Introducing the mechanisms Twitter bot

When describing what we know about the world, scientists often have to state what we don't know. Rather than simply stating, "We don't know how X works," scientists (and especially biologists) have come up with the beautiful syntax, "The mechanisms underlying X are not yet understood." Why use five syllables when you could use 14! In a previous blog post, I explored the history of this syntax, and found its basic form dates to 1950. Later, I invented the Mechanisms rating system, the number of sentences until the mechanisms syntax is used. Since I have grown more interested in data science, I decided to look at the mechanisms syntax from a medium data perspective.

So today, I'm going to present the highlights of 25,000 abstracts that contain mechanisms syntax. Then I will introduce a twitter bot that will once a day tweet out a new mechanism that we don't understand.

Some quick stats about mechanisms

I started by querying Pubmed for 100,000 abstracts containing both a mechanism word like "mechanism" or "pathway", and a clarity word like "unknown." I then filtered the abstracts for those which used a mechanism and clarity word in the same sentence, yielding a final pool of 25,000 abstracts. If you want to see a bunch of these sentences, you can visit the Jupyter notebook for this project.

The shortest sentence in this corpus is concise, "The mechanisms are unclear."

The longest sentence highlights the beauty of the syntax, how you can write dozens of words about what we do know, then switch it up and talk about what we don't know:
Although an increasing number of studies have identified misregulated miRNAs in the neurodegenerative diseases (NDDs) Alzheimer's disease, Parkinson's disease, Huntington's disease, and amyotrophic lateral sclerosis, which suggests that alterations in the miRNA regulatory pathway could contribute to disease pathogenesis, the molecular mechanisms underlying the pathological implications of misregulated miRNA expression and the regulation of the key genes involved in NDDs remain largely unknown. PMID: 26663180
If you didn't feel like reading that, I'll summarize: micro RNAs are important for brain diseases, but we don't know how they work.

For some summary stats of the 25,000 sentences:
  • 19,500 ended with some variant of "unknown"
  • 8,900 included "however"
  • 2,800 used "although"
  • 600 used "while"

Mechanisms Twitter bot

Since I can never get enough of these sentences, I made a twitter bot to find them for me. Each day, the bot queries Pubmed for new abstracts using the syntax, and tweets it with a link.

While this was a fun side project, when you look at thousands of these sentences, they form a sort of catalogue of everything we don't know. Biofilms, leukocytes, 6-gingerol... For many of these papers, I would have to spend some time to figure out what we do and don't know.

So if you want a fun daily reminder of everything we don't know, please follow.

Monday, December 28, 2015

Exploring random forest hyper-parameters using a League of Legends dataset

Over the last few blog posts, I have used random forests to investigate data from the game League of Legends. In this last post, I will explore model optimization. Specifically I will look at how hyper-parameters like forest size, and node size can influence classification accuracy, show that dimensionality reduction doesn't help random forests, and compare random forest performance to Naive Bayes. For details of this post, please look at this Jupyter notebook.


As a refresher, League of Legends is an online multiplayer game where teams try to destroy each others' base. The aim of this project is to predict the eventual winner of a game at early timepoints, using game conditions; for example, I would like to be able to say the Blue Team with a 10,000 gold lead has an X% chance of winning, and determine X by machine learning. I gathered data from Riot Games's API, and created pandas dataframes containing information from 30,000 games at 5 minute intervals. For examples, you can see previous posts. In this post, as mentioned above, I will explore how I optimized my random forest model.

Forest size

The basic task my random forest is trying to do is predict the winner of a game of League of Legends, a classification task. Random forests work by generating a large number of decision trees, and averaging the results across trees. Thus one of the most obvious random forest parameters to optimize is the number of trees in the forest. To look at this, I ran the prediction algorithm many times, with different number of trees for each run.

As you increase the number of trees in the forest, the accuracy improves, until it plateaus around 25 trees. I am using around a dozen features in my dataset, which makes me hypothesize that your forest should have a number of trees equal to twice the number of features. It would be interesting to look at datasets with more features to see how they perform, and see how general this is.

Node size

The software package I am using for building random forests, scikit-learn, by default creates decision trees where each leaf is pure (that is, each group contains only wins or losses, and no mixture of wins and losses). This can lead to overfitting of individual trees, and by extension the random forest. However, you can set a parameter in the random forest to increase the minimum leaf size, or the minimum size of nodes for splitting. To see how this influenced prediction accuracy, I ran the prediction algorithm over a range of minimum node sizes.

The model accuracy goes up as the minimum sample size increases, plateauing around a 200 sample minimum. The model then maintains its accuracy for larger minimums, before decreasing once the minimum size is 5,000, or approximately 15% of the data. Once the minimum is that large, the trees are restricted in their depth, and cannot make enough splits to separate the data.

Rather than making graphs like this for every parameter, I ran a grid search over tree depth, leaf size, and node size. The best parameters for my dataset with 30,000 games, and a dozen features was a maximum depth of 10 levels, minimum leaf size of 10, and minimum split size of 1000. Optimizing these parameters yielded a 2-5% increase in accuracy over the default parameters (over a range of timepoints).

Dimensionality reduction

There is a lot of collinearity in my dataset, which can be a problem for certain machine learning algorithms; theoretically, random forests are resilient to collinearity. To see this for myself, I performed dimensionality reduction, specifically PCA, on my features, and then ran the random forest algorithm again. Here I am presenting the data slightly differently, in terms of prediction accuracy at specific times within the game.
The PCA model is actually a little worse! In the past I have used PCA on data before performing regression, and which improved regression performance by 5%. It's good to know that random forests work as advertised, and don't require dimentionality reduction beforehand.

Naïve Bayes

Part of the reason I did this project was to gain experience with random forests, and compare their performance with other algorithms. Naïve Bayes is a decent, fast default algorithm which is often used for benchmarking. I hadn't used it previously, as there is a little bit of manipulation necessary to combine categorical and continuous features in Naïve Bayes (like 5 lines of code!). As a final experiment, I wanted to see how Naïve Bayes compared to Random Forests.
Naïve Bayes performs almost as well! And runs a whole lot faster! What is probably happening here is that the dataset is just not deep enough for the strength of random forests to shine.

Friday, November 20, 2015

Influence of region, skill and patch on predictability in League of Legends

Last month, I used random forests to predict the winner of League of Legends games. Since then I have downloaded datasets from different regions, different ELOs, and on different patches, and checked how these factors influence predictability. This post will summarize the results, but for details of the analyses, please see this notebook.

Differences between regions

Korea is famed for its great players, and an "open mid" philosophy (in a normal game of League of Legends, you cannot surrender until 20 minutes into the game; in Korea, players accelerate this process by letting the other team win the game quickly through an "open mid"). Are we able to see those attributes in our model? As a first way to look at this, we can plot the length of games in different regions.
The EUW and NA have similar distributions of game lengths, with around 8% of games being surrendered at 20 minutes. The Korean games, on the other hand, have a few percent more games ending before 20 minutes, due to open mids, and a 14% of games ending right after 20 minutes. Since so many games are ending earlier, can we see this in the model's predictions?

Indeed we can, with games being a few percentage points more predictable for games that last 20 minutes or less. However, once the games last 25 minutes, all the regions are similarly predictable. This means that the Koreans are surrendering winnable games.

Skill and predictability

We can run the same type of analysis for skill. In the notebook I compare Bronze V games to challenger solo queue, and show that low skill games are slower, and less predictable than high skill games. Here, I will compare high skill solo queue games to master tier team ranked. The team ranked dataset is quite a bit smaller than the solo queue data (1,000 games vs 30,000), but in my tuning, I've found there is not a big difference in accuracy when the dataset is over 1,000 games.

We can start again by plotting game length.

Team ranked games are about two minutes faster than solo queue games. We can then run the prediction algorithm again on the team ranked games.

Team ranked games are much more predictable than solo queue games! (This is in fact the largest difference in predictability I have seen between different datasets). What is likely happening here is that coordinated teams are more able to capitalize on their advantages, and press them around the map. With fewer mistakes to allow comebacks, the games are faster, and more predictable. It would be interesting to further look at professional games to see how they compare.

Patches and predictability

In recent patches, Riot has made significant changes to the game, which makes it feel like the games are faster, and comebacks more difficult. I scraped challenger and master ranked games from the most recent patch, and compared them to ranked games from Season 5. We can again start with game length.

If you thought games were faster, your intuition was right, as games are around three minutes faster in preseason 2016. What about predictability?

As you'd expect given the previous relationships between speed and predictability, preseason 2016 is a few percentage easier to predict. In a previous blog post, I investigated the importance of different features on predictability, and found that dragons were not important in helping teams win. Putting on my amateur designer hat, this means that dragon is even less important, since games are less likely to last long enough for the accumulated advantages to matter.

In my next post, I'm going to go more in depth on the modeling, investigating random forest hyper-parameters, and maybe trying a few different models.