Pages

Friday, February 16, 2018

Paper trail day trip: Compositional Coulee

The distributional hypothesis states that the meaning of a word is related to the words that usually surround it. For example "table" is usually near words like "sit," "chair," and "wood," so we can get a sense of what "table" means. For phrase, you can usually figure out a phrase's meaning by combining the meaning of its constituent words. For example "table tennis," has something to do with hitting a ball on a flat surface. The idea that the meaning of a phrase comes from combining the meaning of its constituent words is called "compositionality." (In vector space composing a phrase would mean something like adding or maximizing the embedding of its constituent words. People have developed complicated composing functions that consider part-of-speech and more.)

However, the distributional hypothesis has exceptions, and this includes "non-compositional" phrases. In the context of Microsoft, a "pivot table" has nothing to do with rotating furniture, but rather is an Excel feature. Idioms are common examples of non-compositional phrases. Some phrases are polysemous (have multiple meanings) and can be both compositional and non-compositional: "drive home" can mean going to my house, or reiterating a point.

So how do we try to understand the meaning of phrases knowing they might be non-compositional? In the past, many people have ignored this dichotomy by either: treating all phrases as compositional, i.e. just aggregating the meaning of every individual word; or treating all phrases as non-compositional, which means treating each phrase as a single token. In a setting where your text corpus is infinite, the non-compositional method would be the most accurate; however, in the real world where phrases are sparse, treating all phrases as non-compositional creates a huge data gap. What would be ideal is to find a way to identify whether or not a phrase is compositional, and then treat it as such.

Luckily, Hashimoto and Tsuruoka have done just that! Their basic approach is to train a word embedding that has both a compositional and non-compositional model for every phrase, and figure out how to blend the two. (For an overview of word embeddings, see my previous post.)

Representing phrases as a combination of meanings


In more detail, they trained two embeddings for a phrase p, c(p) and n(p), which are the compositional and non-compositional embeddings respectively. Then they combined these two representations using a weighting function, a(p), which could range from 0 to 1 (from compositional to non-compositional). The final embedding for a phrase is then:

v(p) = a(p) * c(p) + (1-a(p)) * n(p)

a(p) is parameterized as a logistic regression for phrase p. When they trained their embedding, they performed a joint optimization for both v(p) and a(p). This means that for every phrase they got its vector representation v(p) as well as a measure of its compositionality, a(p). They trained on the British National Corpus, and English Wikipedia.

For select phrases, they visualized how the compositionality weight changed over epochs of training.  They initialized all phrases' a(p) = 0.5; over time these weights diverged towards mostly compositional or non-compositional:
Compositionality scores change over the course of training. Phrases like "buy car" are judged to be compositional while "shed light" is not. Only one phrase is in the middle, "make noise."
They also provided some examples of phrase similarity using different compositional alphas, which illustrate how treating non-compositional phrases as compositional can yield strange results.

Similar phrases for selected phrases using compositional (a(VO) = 1), mixed (0.5), or non-compositional (0.1-0.14) embeddings. 


Evaluation


After getting their embedding and compositionality weights for phrases, they evaluated their embedding on verb-object pairs that had been scored by humans for their compositionality (I wonder how many of the scorers were NLP grad students). For example, "buy car" was given a high compositionality score of 6, while "bear fruit" was given a score of 1. They then compared their embedding's compositionality weights to the human scores using Spearman correlation, and found a correlation of 0.5, compared to the human-human correlation of 0.7. Pretty good!

As a second evaluation, they used a subject-verb-object (SVO) dataset that was rated for how similar pairs of SVOs were, in the context of polysemous verbs. For example, "run" and "operate" are similar in the phrases "people run companies" and "people operate companies," but "people move companies" is not similar to either.  Again they used Spearman correlation, and this time got 0.6 compared to 0.75.

Conclusion

I feel like the conclusion of this paper is a "no duh": if a phrase is non-compositional, treat it as a unique entity; if a phrase is compositional, just use the meaning of its individual words. The joint optimization was pretty cool, although I wonder about its speed. I like that word2vec can be trained in ten minutes on a few million documents.

I also wonder whether phrases can be cleanly separated into compositional and non-compositional phrases, which would make things much simpler. To know that, we would have to know the distribution of compositionality scores, which I wish they showed (computer science papers are generally mediocre when it comes to presenting data). If compositionality is bimodally distributed, it would be fairly easy to calculate it in normal word2vec space: you calculate the cosine similarity between the phrase and its components, and measure it! Then if they are similar, you could break up the phrase, and retrain the model.

(Postscript: I did this on a corpus of Microsoft support text. The distribution of cosine similarity between a phrase and the mean of its constituent words is normal-ish, which makes identifying compositional and non-compositional words less clean. Maybe a good heuristic for whether to treat a phrase as a token is: keep it if it's is bottom 25% for cosine similarity; or keep it if it's common enough to get a good embedding.)