Thursday, December 28, 2017

Paper trail day trip: Lac Lison

(Blogger note: I am now a Data Scientist in Microsoft Support Engineering, working with natural language processing (NLP). I have been reading NLP papers for work, and rather than just post summaries to our Teams (Microsoft Slack) channel, I figured I could summarize them here for my future self, and others.)

I use word embeddings regularly at work, specifically word2vec. Word2vec creates vector representations of words, by analyzing the word contexts they appear in. These representations are useful since it lets you calculate similarity between words by calculating the distance between them. The Word2vec model has many hyperparameters that can effect its performance. Today's paper, "Redefining Context Window for Word Embedding Models: An Experimental Study," by Lison and Kutuzov is basically a grid search over some of these hyperparameters, using Google News articles, and teleplay scripts as corpuses. For evaluation, they used a measure of semantic and lexical similarity (SimLex999), and a measure of analogies.

Here is what they found for each hyperparameter:

Context window:

The "context" for a word is the surrounding words. This hyperparameter determines how many words are in the context. E.g. if the window is 1, then only words adjacent to a target word are considered; for a window of 5, the five words before and after a target word are in context.

In the past, it has been found that narrow windows improve the functional (part of speech) and synonymic qualities of an embedding; and wide windows improve topicality or analogies. This makes intuitive sense. For example, consider a narrow context like "how ____ Excel." The words that make sense in this context are mostly verbs, but the verbs could have widely different meanings. In contrast, consider a wider context like "how do I use the ____ in Excel to change date formats." Here, the missing word could be a verb or feature, but the topic is likely related to datetimes in Excel.

This paper performed a grid search over context windows of 1, 2, 5, and 10 words. Their results mirror the conventional wisdom: as window size increased, semantic similarity decreased, and analogy performance increased.

A. Performance on SimLex goes down as window size increases (Google news corpus).
B. Performance on analogies increases with window size (Subtitles corpus)

Window position:

In standard word embeddings, the context window is symmetric around a target word, e.g. five words before and after it. Lison and Kutuzov tried using asymmetric embeddings to right and left of a target word. In all hyperparameter combinations and corpora, the left window was worse than symmetric; however, in some combinations, the right window was equally as good as the symmetric one. This is interesting from a linguistic perspective, but given that symmetric embeddings work the best, I'm not sure this is an actionable insight. It does make me wonder how asymmetric windows would work in other languages with different word order.

In the subtitles corpus using the functional metric, right and symmetric windows performed similarly, while left performed worse.

Cross-sentential embeddings:

Cross-sentential is a fancy word for letting contexts cross sentence boundaries. To do this, you can put an entire document through the model rather than chopping it into sentences beforehand. The Google News corpus had longer sentences than the subtitles corpus (21 words to 7). In the Google News corpus there was almost no difference between split-sentential and cross-sentential embeddings (perhaps due to the longer sentence length). In the subtitles corpus, however, functional scores were decreased by using a cross-sentential embeddings, and analogy performance increased. This was especially pronounced for wide windows in the cross-sentential context (where the window was wider than a single sentence).

Performance of embeddings using subtitle corpus.
A. Cross-sentential embeddings reduce functional performance
B. Cross-sentential embeddings increase topic performance
I think the takeaway here is the cross-sentential embeddings can be useful for specific goals: if you are more concerned with topicality and have short documents, it can improve performance.

Window weighting:

Most embedding systems weight near words more than far words, either linearly (word2vec) or harmonically (GloVE). They compared linear and square root weightings, and found no difference.

Stop word removal:

Removing stop words improved analogy performance in both corpuses without reducing semantic performance. It seems this standard procedure is useful.

Stop words removal SimLex999 Analogies
OS no removal         0.41       0.34
OS with removal 0.42       0.43
GW no removal         0.44       0.64
GW with removal       0.44       0.68


I think their grid search yielded four insights:
  1. There is a tradeoff in context window width between functional and topical similarity
  2. For short sentence corpuses, there is also a tradeoff between functional and topical similarity by using cross-sentential embeddings
  3. Stop word removal generally helps embeddings
  4. In English, the right context is more important than the left context

Sunday, January 1, 2017

Using KDTrees in Apache Spark

OR: The best Coffee Shop in Hong Kong to catch Pokemon

I work with spatial data all the time, and one of the most common things I do with spatial data is find the nearest locations between two sets of objects. For example, in the context of Pokemon Go, you might ask, "what is the nearest Pokestop to a given Pokemon?" The standard way to do this is to use a data structure called a KDTree.

In the past six months, I have started using Apache Spark, and quickly grown to love it. However, I haven't found any good tutorials on how to use KDTrees in Spark. To fill the void, I have written a short tutorial on how to use scipy KDTrees in Spark. The tutorial covers how to load Pokemon location information from Hong Kong, why KDTrees are great, how to create a KDTree of coffee shops in Hong Kong, and the code to combine them using Spark. I wrote the tutorial as a Jupyter notebook, but haven't figured out how to embed those in Blogger, so head over to Github for a gander.

If you want a sneak preview, this is how I define the udf which does the query:
coffee_udf = F.udf( partial(query_kdtree, cur_tree = coffee_tree_broadcast),
                    T.ArrayType( T.IntegerType() ) )