Contextualized Word Representations
How to generate word vectors depending on the context that can cater to challenge of polysemy along with capturing information regarding its syntactic and semantic characteristics. A method as published in “Deep Contextualized Word Representations” in NAACL, 2018.
Authors: Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
Problem at hand?
To create word representations that incorporate the semantics according to the context that they are mentioned in.
Chico Ruiz made a spectacular play on Alusik’s grounder {. . . }
- mentions ‘play’ in context of game/sports.Olivia De Havilland signed to do a Broadway play for Garson {. . . }
- mentions ‘play’ in context of theatre/act.
Input and Output
Input
- an input text
Output
- word (contextual) vectors for each token in the text
What’s already existing?
- Two of the most popular word embedding models: Glove and Word2Vec.
- Recently, a number of methods that attempt to capture context-dependent representations have come up. Notably: context2vec, CoVe, TagLM.
Why to look for a new method?
- Existing models like Word2Vec or Glove provides a single context-independent representations.
- Some translation based models (CoVe) which uses transfer learning to supervised tasks are limited by the size of parallel corpora.
- Rest of models like TagLM, context2vec utilizes representations from final layer of the model which essentially leave out information from the lower layers.
A quick overview
The paper uses bidirectional Language Models (biLM) inspired from TagLM paper to use abundant monolingual corpora to generate word representations which can be further transferred to supervised tasks.
What they propose?
In contrast to the previous methods, authors use a linear combination of representations learnt at all biLM layers instead of just using the final layer representation.
Let’s dive in crux!
Let’s see the architecture:
- Token Representation: A context-independent token representation is computed with character convolutions.
- Bidirectional Language Model (biLM): A conventional concept of predicting next token in the sequence given past tokens. In bidirectional model, tokens are predicted using information from forward (past context) as well as backward (future context) direction.
- ELMo: Embedding from language models is the part where this work deviates from the future works. Instead of taking final layer representation from biLM, ELMo is task specific combination of intermediate layers.
- ELMo for supervised tasks: The elmo representations are used for supervised tasks wherein the weights of LM are freezed and token embeddings are concatenated with task specific elmo embeddings to create ELMo enhanced representation.
Objective Function:
For training language models, the objective is to maximize the log-likelihood of forward and backward direction. However, perplexity is common metric to understand the quality of trained LMs.
For supervised tasks, task specific objective functions are employed.
Which dataset do they use?
- Language model is trained on 30 million sentences.
- Supervised tasks are trained on respective standard datasets.
- SQuAD: Stanford Question Answer Dataset (100k+ QA pairs)
- SNLI: Textual entailment (550k hypothesis-premise pairs) on Stanford dataset
- SRL: Semantic Role Labelling on OntoNotes dataset
- Coref Resolution on OntoNotes dataset from CoNLL 2012 shared tasks
- NER: Named Entity Recognition (PER, LOC, ORG, MISC) on CoNLL 2003 Reuters newswire corpus
- SST-5: Sentiment Classification of Stanford sentiment treebank five point scale corpus.
What numbers do they improve on?
- SQuAD, SRL, Coref, NER: F1
- SNLI, SST-5: Accuracy
- Authors also tested on baseline downstream tasks of Word Sense Disambiguation (WSD) and Parts-of-Speech (POS) tagging achieving 69.0 and 97.3 F1 scores respectively which shows that semantic as well as syntactic information are captured by the ELMo representations.
- In more elaborative experiments, authors show that supervised tasks require significantly less epochs and training data to achieve similar accuracy/F1 score compared to models trained without ELMo representations.
- * All figures and equations have either been taken directly from the paper or have been adapted as per my understanding.