Friday, February 16, 2018

Paper trail day trip: Compositional Coulee

The distributional hypothesis states that the meaning of a word is related to the words that usually surround it. For example "table" is usually near words like "sit," "chair," and "wood," so we can get a sense of what "table" means. For phrase, you can usually figure out a phrase's meaning by combining the meaning of its constituent words. For example "table tennis," has something to do with hitting a ball on a flat surface. The idea that the meaning of a phrase comes from combining the meaning of its constituent words is called "compositionality." (In vector space composing a phrase would mean something like adding or maximizing the embedding of its constituent words. People have developed complicated composing functions that consider part-of-speech and more.)

However, the distributional hypothesis has exceptions, and this includes "non-compositional" phrases. In the context of Microsoft, a "pivot table" has nothing to do with rotating furniture, but rather is an Excel feature. Idioms are common examples of non-compositional phrases. Some phrases are polysemous (have multiple meanings) and can be both compositional and non-compositional: "drive home" can mean going to my house, or reiterating a point.

So how do we try to understand the meaning of phrases knowing they might be non-compositional? In the past, many people have ignored this dichotomy by either: treating all phrases as compositional, i.e. just aggregating the meaning of every individual word; or treating all phrases as non-compositional, which means treating each phrase as a single token. In a setting where your text corpus is infinite, the non-compositional method would be the most accurate; however, in the real world where phrases are sparse, treating all phrases as non-compositional creates a huge data gap. What would be ideal is to find a way to identify whether or not a phrase is compositional, and then treat it as such.

Luckily, Hashimoto and Tsuruoka have done just that! Their basic approach is to train a word embedding that has both a compositional and non-compositional model for every phrase, and figure out how to blend the two. (For an overview of word embeddings, see my previous post.)

Representing phrases as a combination of meanings

In more detail, they trained two embeddings for a phrase p, c(p) and n(p), which are the compositional and non-compositional embeddings respectively. Then they combined these two representations using a weighting function, a(p), which could range from 0 to 1 (from compositional to non-compositional). The final embedding for a phrase is then:

v(p) = a(p) * c(p) + (1-a(p)) * n(p)

a(p) is parameterized as a logistic regression for phrase p. When they trained their embedding, they performed a joint optimization for both v(p) and a(p). This means that for every phrase they got its vector representation v(p) as well as a measure of its compositionality, a(p). They trained on the British National Corpus, and English Wikipedia.

For select phrases, they visualized how the compositionality weight changed over epochs of training.  They initialized all phrases' a(p) = 0.5; over time these weights diverged towards mostly compositional or non-compositional:
Compositionality scores change over the course of training. Phrases like "buy car" are judged to be compositional while "shed light" is not. Only one phrase is in the middle, "make noise."
They also provided some examples of phrase similarity using different compositional alphas, which illustrate how treating non-compositional phrases as compositional can yield strange results.

Similar phrases for selected phrases using compositional (a(VO) = 1), mixed (0.5), or non-compositional (0.1-0.14) embeddings. 


After getting their embedding and compositionality weights for phrases, they evaluated their embedding on verb-object pairs that had been scored by humans for their compositionality (I wonder how many of the scorers were NLP grad students). For example, "buy car" was given a high compositionality score of 6, while "bear fruit" was given a score of 1. They then compared their embedding's compositionality weights to the human scores using Spearman correlation, and found a correlation of 0.5, compared to the human-human correlation of 0.7. Pretty good!

As a second evaluation, they used a subject-verb-object (SVO) dataset that was rated for how similar pairs of SVOs were, in the context of polysemous verbs. For example, "run" and "operate" are similar in the phrases "people run companies" and "people operate companies," but "people move companies" is not similar to either.  Again they used Spearman correlation, and this time got 0.6 compared to 0.75.


I feel like the conclusion of this paper is a "no duh": if a phrase is non-compositional, treat it as a unique entity; if a phrase is compositional, just use the meaning of its individual words. The joint optimization was pretty cool, although I wonder about its speed. I like that word2vec can be trained in ten minutes on a few million documents.

I also wonder whether phrases can be cleanly separated into compositional and non-compositional phrases, which would make things much simpler. To know that, we would have to know the distribution of compositionality scores, which I wish they showed (computer science papers are generally mediocre when it comes to presenting data). If compositionality is bimodally distributed, it would be fairly easy to calculate it in normal word2vec space: you calculate the cosine similarity between the phrase and its components, and measure it! Then if they are similar, you could break up the phrase, and retrain the model.

(Postscript: I did this on a corpus of Microsoft support text. The distribution of cosine similarity between a phrase and the mean of its constituent words is normal-ish, which makes identifying compositional and non-compositional words less clean. Maybe a good heuristic for whether to treat a phrase as a token is: keep it if it's is bottom 25% for cosine similarity; or keep it if it's common enough to get a good embedding.)

Thursday, December 28, 2017

Paper trail day trip: Lac Lison

(Blogger note: I am now a Data Scientist in Microsoft Support Engineering, working with natural language processing (NLP). I have been reading NLP papers for work, and rather than just post summaries to our Teams (Microsoft Slack) channel, I figured I could summarize them here for my future self, and others.)

I use word embeddings regularly at work, specifically word2vec. Word2vec creates vector representations of words, by analyzing the word contexts they appear in. These representations are useful since it lets you calculate similarity between words by calculating the distance between them. The Word2vec model has many hyperparameters that can effect its performance. Today's paper, "Redefining Context Window for Word Embedding Models: An Experimental Study," by Lison and Kutuzov is basically a grid search over some of these hyperparameters, using Google News articles, and teleplay scripts as corpuses. For evaluation, they used a measure of semantic and lexical similarity (SimLex999), and a measure of analogies.

Here is what they found for each hyperparameter:

Context window:

The "context" for a word is the surrounding words. This hyperparameter determines how many words are in the context. E.g. if the window is 1, then only words adjacent to a target word are considered; for a window of 5, the five words before and after a target word are in context.

In the past, it has been found that narrow windows improve the functional (part of speech) and synonymic qualities of an embedding; and wide windows improve topicality or analogies. This makes intuitive sense. For example, consider a narrow context like "how ____ Excel." The words that make sense in this context are mostly verbs, but the verbs could have widely different meanings. In contrast, consider a wider context like "how do I use the ____ in Excel to change date formats." Here, the missing word could be a verb or feature, but the topic is likely related to datetimes in Excel.

This paper performed a grid search over context windows of 1, 2, 5, and 10 words. Their results mirror the conventional wisdom: as window size increased, semantic similarity decreased, and analogy performance increased.

A. Performance on SimLex goes down as window size increases (Google news corpus).
B. Performance on analogies increases with window size (Subtitles corpus)

Window position:

In standard word embeddings, the context window is symmetric around a target word, e.g. five words before and after it. Lison and Kutuzov tried using asymmetric embeddings to right and left of a target word. In all hyperparameter combinations and corpora, the left window was worse than symmetric; however, in some combinations, the right window was equally as good as the symmetric one. This is interesting from a linguistic perspective, but given that symmetric embeddings work the best, I'm not sure this is an actionable insight. It does make me wonder how asymmetric windows would work in other languages with different word order.

In the subtitles corpus using the functional metric, right and symmetric windows performed similarly, while left performed worse.

Cross-sentential embeddings:

Cross-sentential is a fancy word for letting contexts cross sentence boundaries. To do this, you can put an entire document through the model rather than chopping it into sentences beforehand. The Google News corpus had longer sentences than the subtitles corpus (21 words to 7). In the Google News corpus there was almost no difference between split-sentential and cross-sentential embeddings (perhaps due to the longer sentence length). In the subtitles corpus, however, functional scores were decreased by using a cross-sentential embeddings, and analogy performance increased. This was especially pronounced for wide windows in the cross-sentential context (where the window was wider than a single sentence).

Performance of embeddings using subtitle corpus.
A. Cross-sentential embeddings reduce functional performance
B. Cross-sentential embeddings increase topic performance
I think the takeaway here is the cross-sentential embeddings can be useful for specific goals: if you are more concerned with topicality and have short documents, it can improve performance.

Window weighting:

Most embedding systems weight near words more than far words, either linearly (word2vec) or harmonically (GloVE). They compared linear and square root weightings, and found no difference.

Stop word removal:

Removing stop words improved analogy performance in both corpuses without reducing semantic performance. It seems this standard procedure is useful.

Stop words removal SimLex999 Analogies
OS no removal         0.41       0.34
OS with removal 0.42       0.43
GW no removal         0.44       0.64
GW with removal       0.44       0.68


I think their grid search yielded four insights:
  1. There is a tradeoff in context window width between functional and topical similarity
  2. For short sentence corpuses, there is also a tradeoff between functional and topical similarity by using cross-sentential embeddings
  3. Stop word removal generally helps embeddings
  4. In English, the right context is more important than the left context

Sunday, January 1, 2017

Using KDTrees in Apache Spark

OR: The best Coffee Shop in Hong Kong to catch Pokemon

I work with spatial data all the time, and one of the most common things I do with spatial data is find the nearest locations between two sets of objects. For example, in the context of Pokemon Go, you might ask, "what is the nearest Pokestop to a given Pokemon?" The standard way to do this is to use a data structure called a KDTree.

In the past six months, I have started using Apache Spark, and quickly grown to love it. However, I haven't found any good tutorials on how to use KDTrees in Spark. To fill the void, I have written a short tutorial on how to use scipy KDTrees in Spark. The tutorial covers how to load Pokemon location information from Hong Kong, why KDTrees are great, how to create a KDTree of coffee shops in Hong Kong, and the code to combine them using Spark. I wrote the tutorial as a Jupyter notebook, but haven't figured out how to embed those in Blogger, so head over to Github for a gander.

If you want a sneak preview, this is how I define the udf which does the query:
coffee_udf = F.udf( partial(query_kdtree, cur_tree = coffee_tree_broadcast),
                    T.ArrayType( T.IntegerType() ) )

Thursday, March 31, 2016

A simple GUI for analyzing BioDAQ data

The Palmiter lab often monitors food and water intake in response to a variety of stimuli. To quantify these measurements, we house mice in BioDAQ chambers, which will record how much a mouse eats or drinks down to 0.01g. While the chambers are nice, the software that comes with them is terrible. In addition to being slow, it outputs data for each cage separately, which means you get to enjoy combining Excel files. After fighting the software, I decided I could do better, and made a small GUI.

Screenshot of the BioDAQ software. Note the Windows-95 era aesthetic. You can record up to 32 scales at a time. To get a single scale, you have to unclick the other 31! Data for each cage is saved individually.


The goal of the GUI was to be able to analyze many cages over multiple days, and output a single file containing the data for further analysis.

The GUI starts by loading a .tab file which contains the feeding data. The code for this was actually easy, as the data is just a tab-delimited text file. (It takes a little longer than I expected, a few seconds, due to datetime parsing.)

Once the data is loaded, the GUI asks the users for information about which cages to analyze, which dates, times, and how to bin the data by time. If you are interested in the data from 10 cages, binned at one hour increments, over 5 days, you can simply input those numbers. Once everything is set, you can then save the data to a .csv, which will have the same base filename as the input data. The .csv will contain columns for:

date and time
cage id
number of feeding bouts
average bout duration (in seconds)
total eaten (in grams)
number of meals
average meal duration (in seconds)
average meal size (in grams)

Left: The GUI. You can choose a range of cages, dates, times, and more. Right: Output CSV for the file given the parameters on the left. There is information for total food eaten, number of bouts of eating, number of meals, and duration for each of those. If a cage or time does not have information, the row will be blank.

If you use BioDAQs to measure feeding, and are similarly frustrated with the software, you can give this GUI a try! You only need to download two python files, and To run the GUI, you can use either python 2.7 or 3+ (the Anaconda install should have all the relevant modules). Just open a command prompt, and type:


I suggest comparing the output of the GUI to some pre-analyzed data, so you can verify that it works. If you find this helpful, let me know!

Monday, March 28, 2016

A simple GUI for analyzing thermal images

One of the grad students in the lab has started a project on thermoregulation, and he measures mouse tail temperature using an infrared camera from FLIR. FLIR has an analysis tool for its cameras which works OK, but is not really designed for analyzing hundreds of images. To save him some time, I made a simple GUI for analyzing thermal images. In this post I'm going to outline the design of the gui, and how to use it, in case anyone else needs to analyze lots of thermal images.

Creating temperature images

The FLIR camera stores images in pseudocolor jpegs that look like this:

However, these jpegs do not contain the actual temperature data. To figure out where the temperature data was, I consulted this thread from two years ago about FLIR images. I learned that the data is actually contained in the EXIF for the jpeg, is only 80x60 pixels (compared to the jpeg's 320x240), contains intensity data (not temperature), and that the data was stored with the wrong -endian. Luckily, the thread contained enough details that I was able to figure out how to extract the image from the EXIF using exiftool, and was able to switch the endian using ImageMagick.

Once I had the imaging data, I then needed to convert it to temperature. Here the thread came in handy again, specifically this post which outlined how to convert radiance to temperature. All of the constants for the equation are also stored in the EXIF of the jpeg, which allowed me to calculate the temperature for each point.


Once I was able to calculate true-temperature images, it was time to make the GUI! Previously I've made GUIs using QT Designer, but I found an outline of a tkinter script that records the x,y coordinates whenever someone clicks on an image, so I decided to modify that instead. For the image to click on, I decided to go with a grayscale version of the pseudocolor jpeg, as it looks a lot nicer.

To use the GUI, you need to install python 3.4, exiftool and ImageMagick. Then to run the gui, open a command prompt, and go to the directory with the images, then execute:


Grayscale version you can click on! The GUI uses the .jpg for display, but loads the temperature data in the background.

Once it opens, simply click on the pixel you want the temperature of, and the GUI will output the temperature of that pixel on the command line. If you are analyzing a bunch of images, you can hit space to go to the next image, or 'q' to quit. When you are done, the GUI will save a .csv containing the names of each image, and the temperature for that image.

Exciting screenshot of a CSV! Temperatures are in Celsius.
If this GUI sounds interesting to you, you can download the script. That folder contains: 1) a README.txt explaining how to install everything, and intructions on how to run the GUI; and 2) the script for the GUI, Everything was written in python 3. If you are using the script on Windows you may have to install packages for tkinter and image. If you have any problems, please contact me, as I helped someone else in the lab set it up.