Retrofitting Word Vectors to Semantic Lexicons

How to integrate information that comes from existing lexicons to word vectors? A method as published in “Retrofitting Word Vectors to Semantic Lexicons” in NAACL, 2015.

3 min readApr 22, 2018

Authors: Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, Noah A. Smith

Problem at hand?

To bind the information present in semantic lexicons to word vectors.

Say a lexicon L has representation that informs words: ‘good’, ‘brilliant’, ‘exceptional’, ‘great’ and ‘superb’ are connected to each other.
Task is to transfer this information to vectorial representations (popularly known as word embeddings) such that embeddings of these words are similar to each other.

Input and Output

Input

Pre-computed word vectors
Semantic Lexicon

Output

Refined (retro-fitted) word vector

What’s already existing?

Changing objective function of word vector training algorithm with a prior or regularizer in neural language models.
Relation-specific augmentation of co-occurrence matrix in spectral word vector models.

Why to look for a new method?

Existing method employs improvement techniques as a part of training process which essentially requires end-to-end training to incorporate any new lexicon.
Aligning information from semantic lexicon to any word vector is method-specific. Regularization? Relation augmentation? — depends on the word vector training algorithm.

A quick overview

The paper formulates a post-processing method incorporating the idea of belief propagation across relational graph constructed from the lexicon at hand.

What they propose?
How about a setup in which a lexicon is represented as a graph with edges denoting the relation between two nodes (words). Now, each word looks at its neighbors, collect information (their word embeddings) from them and updates itself iteratively.

Let’s dive in crux!

Word graph with edges between related words. Orange nodes are pre-computed words while nodes in white would be the inferred words. Image taken from a talk by Manaal Faruqui¹.

Let’s see the architecture:

Relational graph: Generated from lexicon. Edges represent the relation (synonymy in simplest case) between the words.
Pre-computed word embedding: Glove/Skip-gram/Global-context.
Values for white nodes are updated iteratively.

Objective Function:

Minimize euclidean distance between the observed value and the values of neighbours.

Note that original word embeddings aren’t updated. So, if alpha and beta constant, solving the equation for Q would give following update rule:

In all experiments, alpha is empirically set to 1 and beta inverse of degree of node i.

What does the model effectively output?

Refined (read as retro-fitted) word vectors which incorporate (ideally) information from pre-existing semantic lexicons and pre-computed word vectors.

What lexicons and word vectors do they use?

Lexicons: PPDB, WordNet, FrameNet.
Word Vectors: Glove, Skip-gram, Global-context, Multi-lingual.

What numbers do they improve on?

Word Similarity: Spearman’s correlation to measure how close two vectors for a pair of words are.
Syntactic Relations: To find d that fits: ‘a :b :: c :d’. Measured by accuracy.
Synonym Selection: To find semantically closest word among four provided choices. Measured by accuracy.
Sentiment Analysis: Binary classification. Measured by accuracy.

The method also compares to existing work of joint training of word vectors incorporating information from the lexicons during the training process itself.

* All fig. and equations have been taken from the paper or the talk¹.

Manaal Faruqui talk: https://youtu.be/yG4XbgytH4w