Thursday, December 28, 2017

Paper trail day trip: Lac Lison

(Blogger note: I am now a Data Scientist in Microsoft Support Engineering, working with natural language processing (NLP). I have been reading NLP papers for work, and rather than just post summaries to our Teams (Microsoft Slack) channel, I figured I could summarize them here for my future self, and others.)

I use word embeddings regularly at work, specifically word2vec. Word2vec creates vector representations of words, by analyzing the word contexts they appear in. These representations are useful since it lets you calculate similarity between words by calculating the distance between them. The Word2vec model has many hyperparameters that can effect its performance. Today's paper, "Redefining Context Window for Word Embedding Models: An Experimental Study," by Lison and Kutuzov is basically a grid search over some of these hyperparameters, using Google News articles, and teleplay scripts as corpuses. For evaluation, they used a measure of semantic and lexical similarity (SimLex999), and a measure of analogies.

Here is what they found for each hyperparameter:

Context window:

The "context" for a word is the surrounding words. This hyperparameter determines how many words are in the context. E.g. if the window is 1, then only words adjacent to a target word are considered; for a window of 5, the five words before and after a target word are in context.

In the past, it has been found that narrow windows improve the functional (part of speech) and synonymic qualities of an embedding; and wide windows improve topicality or analogies. This makes intuitive sense. For example, consider a narrow context like "how ____ Excel." The words that make sense in this context are mostly verbs, but the verbs could have widely different meanings. In contrast, consider a wider context like "how do I use the ____ in Excel to change date formats." Here, the missing word could be a verb or feature, but the topic is likely related to datetimes in Excel.

This paper performed a grid search over context windows of 1, 2, 5, and 10 words. Their results mirror the conventional wisdom: as window size increased, semantic similarity decreased, and analogy performance increased.

A. Performance on SimLex goes down as window size increases (Google news corpus).
B. Performance on analogies increases with window size (Subtitles corpus)

Window position:

In standard word embeddings, the context window is symmetric around a target word, e.g. five words before and after it. Lison and Kutuzov tried using asymmetric embeddings to right and left of a target word. In all hyperparameter combinations and corpora, the left window was worse than symmetric; however, in some combinations, the right window was equally as good as the symmetric one. This is interesting from a linguistic perspective, but given that symmetric embeddings work the best, I'm not sure this is an actionable insight. It does make me wonder how asymmetric windows would work in other languages with different word order.

In the subtitles corpus using the functional metric, right and symmetric windows performed similarly, while left performed worse.

Cross-sentential embeddings:

Cross-sentential is a fancy word for letting contexts cross sentence boundaries. To do this, you can put an entire document through the model rather than chopping it into sentences beforehand. The Google News corpus had longer sentences than the subtitles corpus (21 words to 7). In the Google News corpus there was almost no difference between split-sentential and cross-sentential embeddings (perhaps due to the longer sentence length). In the subtitles corpus, however, functional scores were decreased by using a cross-sentential embeddings, and analogy performance increased. This was especially pronounced for wide windows in the cross-sentential context (where the window was wider than a single sentence).

Performance of embeddings using subtitle corpus.
A. Cross-sentential embeddings reduce functional performance
B. Cross-sentential embeddings increase topic performance
I think the takeaway here is the cross-sentential embeddings can be useful for specific goals: if you are more concerned with topicality and have short documents, it can improve performance.

Window weighting:

Most embedding systems weight near words more than far words, either linearly (word2vec) or harmonically (GloVE). They compared linear and square root weightings, and found no difference.

Stop word removal:

Removing stop words improved analogy performance in both corpuses without reducing semantic performance. It seems this standard procedure is useful.

Stop words removal SimLex999 Analogies
OS no removal         0.41       0.34
OS with removal 0.42       0.43
GW no removal         0.44       0.64
GW with removal       0.44       0.68


I think their grid search yielded four insights:
  1. There is a tradeoff in context window width between functional and topical similarity
  2. For short sentence corpuses, there is also a tradeoff between functional and topical similarity by using cross-sentential embeddings
  3. Stop word removal generally helps embeddings
  4. In English, the right context is more important than the left context

Sunday, January 1, 2017

Using KDTrees in Apache Spark

OR: The best Coffee Shop in Hong Kong to catch Pokemon

I work with spatial data all the time, and one of the most common things I do with spatial data is find the nearest locations between two sets of objects. For example, in the context of Pokemon Go, you might ask, "what is the nearest Pokestop to a given Pokemon?" The standard way to do this is to use a data structure called a KDTree.

In the past six months, I have started using Apache Spark, and quickly grown to love it. However, I haven't found any good tutorials on how to use KDTrees in Spark. To fill the void, I have written a short tutorial on how to use scipy KDTrees in Spark. The tutorial covers how to load Pokemon location information from Hong Kong, why KDTrees are great, how to create a KDTree of coffee shops in Hong Kong, and the code to combine them using Spark. I wrote the tutorial as a Jupyter notebook, but haven't figured out how to embed those in Blogger, so head over to Github for a gander.

If you want a sneak preview, this is how I define the udf which does the query:
coffee_udf = F.udf( partial(query_kdtree, cur_tree = coffee_tree_broadcast),
                    T.ArrayType( T.IntegerType() ) )

Thursday, March 31, 2016

A simple GUI for analyzing BioDAQ data

The Palmiter lab often monitors food and water intake in response to a variety of stimuli. To quantify these measurements, we house mice in BioDAQ chambers, which will record how much a mouse eats or drinks down to 0.01g. While the chambers are nice, the software that comes with them is terrible. In addition to being slow, it outputs data for each cage separately, which means you get to enjoy combining Excel files. After fighting the software, I decided I could do better, and made a small GUI.

Screenshot of the BioDAQ software. Note the Windows-95 era aesthetic. You can record up to 32 scales at a time. To get a single scale, you have to unclick the other 31! Data for each cage is saved individually.


The goal of the GUI was to be able to analyze many cages over multiple days, and output a single file containing the data for further analysis.

The GUI starts by loading a .tab file which contains the feeding data. The code for this was actually easy, as the data is just a tab-delimited text file. (It takes a little longer than I expected, a few seconds, due to datetime parsing.)

Once the data is loaded, the GUI asks the users for information about which cages to analyze, which dates, times, and how to bin the data by time. If you are interested in the data from 10 cages, binned at one hour increments, over 5 days, you can simply input those numbers. Once everything is set, you can then save the data to a .csv, which will have the same base filename as the input data. The .csv will contain columns for:

date and time
cage id
number of feeding bouts
average bout duration (in seconds)
total eaten (in grams)
number of meals
average meal duration (in seconds)
average meal size (in grams)

Left: The GUI. You can choose a range of cages, dates, times, and more. Right: Output CSV for the file given the parameters on the left. There is information for total food eaten, number of bouts of eating, number of meals, and duration for each of those. If a cage or time does not have information, the row will be blank.

If you use BioDAQs to measure feeding, and are similarly frustrated with the software, you can give this GUI a try! You only need to download two python files, and To run the GUI, you can use either python 2.7 or 3+ (the Anaconda install should have all the relevant modules). Just open a command prompt, and type:


I suggest comparing the output of the GUI to some pre-analyzed data, so you can verify that it works. If you find this helpful, let me know!

Monday, March 28, 2016

A simple GUI for analyzing thermal images

One of the grad students in the lab has started a project on thermoregulation, and he measures mouse tail temperature using an infrared camera from FLIR. FLIR has an analysis tool for its cameras which works OK, but is not really designed for analyzing hundreds of images. To save him some time, I made a simple GUI for analyzing thermal images. In this post I'm going to outline the design of the gui, and how to use it, in case anyone else needs to analyze lots of thermal images.

Creating temperature images

The FLIR camera stores images in pseudocolor jpegs that look like this:

However, these jpegs do not contain the actual temperature data. To figure out where the temperature data was, I consulted this thread from two years ago about FLIR images. I learned that the data is actually contained in the EXIF for the jpeg, is only 80x60 pixels (compared to the jpeg's 320x240), contains intensity data (not temperature), and that the data was stored with the wrong -endian. Luckily, the thread contained enough details that I was able to figure out how to extract the image from the EXIF using exiftool, and was able to switch the endian using ImageMagick.

Once I had the imaging data, I then needed to convert it to temperature. Here the thread came in handy again, specifically this post which outlined how to convert radiance to temperature. All of the constants for the equation are also stored in the EXIF of the jpeg, which allowed me to calculate the temperature for each point.


Once I was able to calculate true-temperature images, it was time to make the GUI! Previously I've made GUIs using QT Designer, but I found an outline of a tkinter script that records the x,y coordinates whenever someone clicks on an image, so I decided to modify that instead. For the image to click on, I decided to go with a grayscale version of the pseudocolor jpeg, as it looks a lot nicer.

To use the GUI, you need to install python 3.4, exiftool and ImageMagick. Then to run the gui, open a command prompt, and go to the directory with the images, then execute:


Grayscale version you can click on! The GUI uses the .jpg for display, but loads the temperature data in the background.

Once it opens, simply click on the pixel you want the temperature of, and the GUI will output the temperature of that pixel on the command line. If you are analyzing a bunch of images, you can hit space to go to the next image, or 'q' to quit. When you are done, the GUI will save a .csv containing the names of each image, and the temperature for that image.

Exciting screenshot of a CSV! Temperatures are in Celsius.
If this GUI sounds interesting to you, you can download the script. That folder contains: 1) a README.txt explaining how to install everything, and intructions on how to run the GUI; and 2) the script for the GUI, Everything was written in python 3. If you are using the script on Windows you may have to install packages for tkinter and image. If you have any problems, please contact me, as I helped someone else in the lab set it up.

Saturday, January 16, 2016

Introducing the mechanisms Twitter bot

When describing what we know about the world, scientists often have to state what we don't know. Rather than simply stating, "We don't know how X works," scientists (and especially biologists) have come up with the beautiful syntax, "The mechanisms underlying X are not yet understood." Why use five syllables when you could use 14! In a previous blog post, I explored the history of this syntax, and found its basic form dates to 1950. Later, I invented the Mechanisms rating system, the number of sentences until the mechanisms syntax is used. Since I have grown more interested in data science, I decided to look at the mechanisms syntax from a medium data perspective.

So today, I'm going to present the highlights of 25,000 abstracts that contain mechanisms syntax. Then I will introduce a twitter bot that will once a day tweet out a new mechanism that we don't understand.

Some quick stats about mechanisms

I started by querying Pubmed for 100,000 abstracts containing both a mechanism word like "mechanism" or "pathway", and a clarity word like "unknown." I then filtered the abstracts for those which used a mechanism and clarity word in the same sentence, yielding a final pool of 25,000 abstracts. If you want to see a bunch of these sentences, you can visit the Jupyter notebook for this project.

The shortest sentence in this corpus is concise, "The mechanisms are unclear."

The longest sentence highlights the beauty of the syntax, how you can write dozens of words about what we do know, then switch it up and talk about what we don't know:
Although an increasing number of studies have identified misregulated miRNAs in the neurodegenerative diseases (NDDs) Alzheimer's disease, Parkinson's disease, Huntington's disease, and amyotrophic lateral sclerosis, which suggests that alterations in the miRNA regulatory pathway could contribute to disease pathogenesis, the molecular mechanisms underlying the pathological implications of misregulated miRNA expression and the regulation of the key genes involved in NDDs remain largely unknown. PMID: 26663180
If you didn't feel like reading that, I'll summarize: micro RNAs are important for brain diseases, but we don't know how they work.

For some summary stats of the 25,000 sentences:
  • 19,500 ended with some variant of "unknown"
  • 8,900 included "however"
  • 2,800 used "although"
  • 600 used "while"

Mechanisms Twitter bot

Since I can never get enough of these sentences, I made a twitter bot to find them for me. Each day, the bot queries Pubmed for new abstracts using the syntax, and tweets it with a link.

While this was a fun side project, when you look at thousands of these sentences, they form a sort of catalogue of everything we don't know. Biofilms, leukocytes, 6-gingerol... For many of these papers, I would have to spend some time to figure out what we do and don't know.

So if you want a fun daily reminder of everything we don't know, please follow.