Contextualized Word Representations

How to generate word vectors depending on the context that can cater to challenge of polysemy along with capturing information regarding its syntactic and semantic characteristics. A method as published in “Deep Contextualized Word Representations” in NAACL, 2018.

4 min readMay 13, 2018

--

Authors: Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Problem at hand?

To create word representations that incorporate the semantics according to the context that they are mentioned in.

Chico Ruiz made a spectacular play on Alusik’s grounder {. . . }
- mentions ‘play’ in context of game/sports.
Olivia De Havilland signed to do a Broadway play for Garson {. . . }
- mentions ‘play’ in context of theatre/act.

Input and Output

Input

an input text

Output

word (contextual) vectors for each token in the text

What’s already existing?

Two of the most popular word embedding models: Glove and Word2Vec.
Recently, a number of methods that attempt to capture context-dependent representations have come up. Notably: context2vec, CoVe, TagLM.

Why to look for a new method?

Existing models like Word2Vec or Glove provides a single context-independent representations.
Some translation based models (CoVe) which uses transfer learning to supervised tasks are limited by the size of parallel corpora.
Rest of models like TagLM, context2vec utilizes representations from final layer of the model which essentially leave out information from the lower layers.

A quick overview

The paper uses bidirectional Language Models (biLM) inspired from TagLM paper to use abundant monolingual corpora to generate word representations which can be further transferred to supervised tasks.

What they propose?
In contrast to the previous methods, authors use a linear combination of representations learnt at all biLM layers instead of just using the final layer representation.

Let’s dive in crux!

High level representation of ELMo vector computation. Note that final ELMo vector is linear combination of representation from all layers.

Let’s see the architecture:

Token Representation: A context-independent token representation is computed with character convolutions.
Bidirectional Language Model (biLM): A conventional concept of predicting next token in the sequence given past tokens. In bidirectional model, tokens are predicted using information from forward (past context) as well as backward (future context) direction.

*h is the concatenation of forward and backward representation of token k at layer j.*

ELMo: Embedding from language models is the part where this work deviates from the future works. Instead of taking final layer representation from biLM, ELMo is task specific combination of intermediate layers.

y is the scaling parameter to allow task specific training, while s is scalar parameter representing softmax normalized weights for each layer.

ELMo for supervised tasks: The elmo representations are used for supervised tasks wherein the weights of LM are freezed and token embeddings are concatenated with task specific elmo embeddings to create ELMo enhanced representation.

Objective Function:

For training language models, the objective is to maximize the log-likelihood of forward and backward direction. However, perplexity is common metric to understand the quality of trained LMs.

For supervised tasks, task specific objective functions are employed.

Which dataset do they use?

Language model is trained on 30 million sentences.
Supervised tasks are trained on respective standard datasets.
SQuAD: Stanford Question Answer Dataset (100k+ QA pairs)
SNLI: Textual entailment (550k hypothesis-premise pairs) on Stanford dataset
SRL: Semantic Role Labelling on OntoNotes dataset
Coref Resolution on OntoNotes dataset from CoNLL 2012 shared tasks
NER: Named Entity Recognition (PER, LOC, ORG, MISC) on CoNLL 2003 Reuters newswire corpus
SST-5: Sentiment Classification of Stanford sentiment treebank five point scale corpus.

What numbers do they improve on?

SQuAD, SRL, Coref, NER: F1
SNLI, SST-5: Accuracy

Results on using ELMo representations in supervised tasks.

Authors also tested on baseline downstream tasks of Word Sense Disambiguation (WSD) and Parts-of-Speech (POS) tagging achieving 69.0 and 97.3 F1 scores respectively which shows that semantic as well as syntactic information are captured by the ELMo representations.

Comparison on using ELMo representations in training supervised tasks.

In more elaborative experiments, authors show that supervised tasks require significantly less epochs and training data to achieve similar accuracy/F1 score compared to models trained without ELMo representations.

* All figures and equations have either been taken directly from the paper or have been adapted as per my understanding.