I use word embeddings regularly at work, specifically word2vec. Word2vec creates vector representations of words, by analyzing the word contexts they appear in. These representations are useful since it lets you calculate similarity between words by calculating the distance between them. The Word2vec model has many hyperparameters that can effect its performance. Today's paper, "Redefining Context Window for Word Embedding Models: An Experimental Study," by Lison and Kutuzov is basically a grid search over some of these hyperparameters, using Google News articles, and teleplay scripts as corpuses. For evaluation, they used a measure of semantic and lexical similarity (SimLex999), and a measure of analogies.
Here is what they found for each hyperparameter:
Context window:The "context" for a word is the surrounding words. This hyperparameter determines how many words are in the context. E.g. if the window is 1, then only words adjacent to a target word are considered; for a window of 5, the five words before and after a target word are in context.
In the past, it has been found that narrow windows improve the functional (part of speech) and synonymic qualities of an embedding; and wide windows improve topicality or analogies. This makes intuitive sense. For example, consider a narrow context like "how ____ Excel." The words that make sense in this context are mostly verbs, but the verbs could have widely different meanings. In contrast, consider a wider context like "how do I use the ____ in Excel to change date formats." Here, the missing word could be a verb or feature, but the topic is likely related to datetimes in Excel.
This paper performed a grid search over context windows of 1, 2, 5, and 10 words. Their results mirror the conventional wisdom: as window size increased, semantic similarity decreased, and analogy performance increased.
|A. Performance on SimLex goes down as window size increases (Google news corpus).|
B. Performance on analogies increases with window size (Subtitles corpus)
In standard word embeddings, the context window is symmetric around a target word, e.g. five words before and after it. Lison and Kutuzov tried using asymmetric embeddings to right and left of a target word. In all hyperparameter combinations and corpora, the left window was worse than symmetric; however, in some combinations, the right window was equally as good as the symmetric one. This is interesting from a linguistic perspective, but given that symmetric embeddings work the best, I'm not sure this is an actionable insight. It does make me wonder how asymmetric windows would work in other languages with different word order.
|In the subtitles corpus using the functional metric, right and symmetric windows performed similarly, while left performed worse.|
Cross-sentential is a fancy word for letting contexts cross sentence boundaries. To do this, you can put an entire document through the model rather than chopping it into sentences beforehand. The Google News corpus had longer sentences than the subtitles corpus (21 words to 7). In the Google News corpus there was almost no difference between split-sentential and cross-sentential embeddings (perhaps due to the longer sentence length). In the subtitles corpus, however, functional scores were decreased by using a cross-sentential embeddings, and analogy performance increased. This was especially pronounced for wide windows in the cross-sentential context (where the window was wider than a single sentence).
I think the takeaway here is the cross-sentential embeddings can be useful for specific goals: if you are more concerned with topicality and have short documents, it can improve performance.
|Performance of embeddings using subtitle corpus.|
A. Cross-sentential embeddings reduce functional performance
B. Cross-sentential embeddings increase topic performance
Most embedding systems weight near words more than far words, either linearly (word2vec) or harmonically (GloVE). They compared linear and square root weightings, and found no difference.
Stop word removal:
Removing stop words improved analogy performance in both corpuses without reducing semantic performance. It seems this standard procedure is useful.
Stop words removal SimLex999 Analogies
OS no removal 0.41 0.34
OS with removal 0.42 0.43
GW no removal 0.44 0.64
GW with removal 0.44 0.68
I think their grid search yielded four insights:
- There is a tradeoff in context window width between functional and topical similarity
- For short sentence corpuses, there is also a tradeoff between functional and topical similarity by using cross-sentential embeddings
- Stop word removal generally helps embeddings
- In English, the right context is more important than the left context