Learned in translation: contextualized word vectors

By Bryan McCann

Natural language processing (NLP) has a great way of instilling new neural networks with an understanding of individual words, but the field has yet to find a way to initialize new networks with an understanding of how those words might relate to other words. Our work proposes to use networks that have already learned how to contextualize words to give new neural networks an advantage in learning to understand other parts of natural language.

For most problems in NLP, understanding context is essential. Translation models need to understand how the words in an English sentence work together in order to generate a German translation. Summarization models need context in order to know which words are most important. Models performing sentiment analysis need to understand how to pick up on key words that change the sentiment expressed by others. Question answering models rely on an understanding of how words in a question shift the importance of words in a document. As each of these models needs to understand how context influences a word's meaning, each can benefit from teaming up with a model that has already learned how to contextualize words.

A path towards the Imagenet-CNN of NLP

Computer vision has had more success finding reusable representations than NLP. Deep convolutional neural networks (CNNs) trained on a large image classification dataset, ImageNet, are frequently used as components in other models. In order to classify images well, CNNs learn representations of images by progressively building up a more complex understanding of how pixels relate to other pixels. Models tackling tasks like image captioning, facial recognition, and object detection can then start with these representations rather than from scratch. NLP should be able to do something similar with words and their context.

We teach a neural network how to understand words in context by first teaching it how to translate English to German. Then, we show that we can reuse this network in a way that mirrors the reuse of CNNs trained on ImageNet in computer vision. We do this by treating the network's outputs, which we call context vectors (CoVe), as inputs to new networks that learn other NLP tasks. In our experiments, providing CoVe to these new networks always improves their performance, so we are excited to release the trained network that generates CoVe in order to facilitate further exploration of reusable representations in NLP.

Word vectors

Most deep learning models for NLP today rely on word vectors to represent the meaning of individual words. For those unfamiliar with this idea, all this means is that we associate each word in the language with a list of numbers called a vector.

Figure 1: It is common in deep learning to represent words as vectors. Instead of reading sequeneces words as text, deep learning models read sequences of word vectors.

Pretrained word vectors

There are times when word vectors are initialized to lists of random numbers before a model is trained for a specific task, but it is also quite common to initialize the word vectors of a model with those obtained by running methods like word2vec, GloVe, or FastText. Each of those methods defines a way of learning word vectors with useful properties. The first two work off of the hypothesis that at least part of a word's meaning is tied to how it is used.

word2vec trains a model to take in a word and predict a local context window; the model sees a word and tries to predict the words around it.

Figure 2: Algorithms like word2vec and GloVe produce word vectors correlated with the word vectors that regularly occur around it in natural langauge. In this way the vector for 'vectors' comes to mean that the word 'vectors' appears around words like 'lists', 'of', and 'numbers'.

GloVe takes a similar approach, but it also explicitly adds statistics about how often each word occurs with each other word. In both cases, each word is represented by a corresponding word vector, and training forces the word vectors to correlate with each other in ways that are tied to the usage of the word in natural language.

The emergent properties of pretrained word vectors

Viewing these word vectors as points in space, we can see fascinating emergent relationships that are reminiscent of semantic relationships between words.

Figure 3: Differences between vectors capture male-female word pairs (Pennington et al 2014).

Figure 4: For relationship a-b, c:d means that c+(a-b) yields a vector closest to d (Mikolov et al 2013).

Figure 5: Differences between vectors capture comparative and superlative relationships (Pennington et al 2014).

It quickly came to light that initializing a model for a target task with word vectors pretrained for intermediate tasks defined by word2vec or GloVe would give the model an advantage on the target task. Word vectors produced by word2vec and GloVe thus found their way into widespread experiments across the many tasks in NLP.

Hidden vectors

These pretrained word vectors exhibit interesting properties and provide a performance gain over randomly initialized word vectors, but, as previously mentioned, words rarely appear in isolation. Models that use pretrained word vectors must learn how to use them. Our work picks up where word vectors left off by looking to improve over randomly initialized methods for contextualizing word vectors through training on an intermediate task.


A common approach to contextualizing word vectors is to use a recurrent neural network (RNN). RNNs are deep learning models that process vector sequences of variable length. This makes them suitable for processing sequences of word vectors. We use a specific kind of RNN called Long Short-Term Memory (LSTM) to better handle long sequences. At each step in processing, the LSTM takes in a word vector and outputs a new vector called the hidden vector. This process is often referred to as encoding the sequence, and the neural network that does the encoding is referred to as an encoder.

Figure 6: An LSTM encoder takes in a sequence of word vectors and outputs a sequence of hidden vectors.

Bidirectional encoders

These hidden vectors do not incorporate information from words that appear later in the sequence, but this is easily remedied. We can run an LSTM backwards to get some backwards output vectors, and we can concatenate these with the output vectors from the forward LSTM to get a more useful hidden vector. We treat this pair of forward and backward LSTMs as a unit, and it is typically referred to as a bidirectional LSTM. It takes in a sequence of word vectors, runs a forward and a backward LSTM, concatenates outputs corresponding to the same input, and returns the resulting sequence of hidden vectors.

Figure 7: A bidirectional encoder incorporates information that precedes and follows each word.

We use a stack of two bidirectional LSTMs as the encoder. The first bidirectional LSTM processes its entire sequence before passing outputs to the second.

Hidden vectors in machine translation

Just as pretrained word vectors proved to be useful representations for many NLP tasks, we looked to pretrain our encoder so that it would output generally useful hidden vectors. For this, we chose machine translation as the first training task. Machine translation training sets are much larger than those for most other NLP tasks, and the nature of the translation task seemed to have appealing properties for training a general context encoder, e.g. translation seems to require a more general sense of language understanding than tasks like text classification.


We teach the encoder how to generate useful hidden vectors by teaching it how to translate English sentences to German sentences. The encoder produces hidden vectors for the English sentence, and another neural network called the decoder references those hidden vectors as it generates the German sentence.

Just as LSTMs are the backbone of our encoder, LSTMs play an important role in the decoder as well. We use a decoder LSTMs with two layers just like the encoder. The decoder LSTMs is initialized from the final states of the encoder, reads in a special German word vector to start, and generates a decoder state vector.

Figure 8: The decoder uses a unidirectional LSTMs to create the decoder state from input word vectors.


The attention mechanism looks back at the hidden vectors in order to decide which part of the English sentence to translate next. It uses the state vector to determine how important each hidden vector is, and then it produces a new vector, which we will call the context-adjusted state, to record its observation.

Figure 9: The attention mechanism uses the hidden states and decoder state to produce a context-adjusted state.


The generator then looks at the context-adjusted state to determine which German word to output, and the context-adjusted state is passed back to the decoder so that it has an accurate sense of what it has already translated. The decoder repeats this process until it is done translating. This is a standard attentional encoder-decoder architecture for learning sequence to sequence tasks like machine translation.

Figure 10: The generator uses the context-adjusted state to select an output word.

Context vectors from a pretrained MT-LSTM

When training is finished, we can extract the LSTM that we trained as an encoder for machine translation. We call this pretrained LSTM an MT-LSTM and use it to output hidden vectors for new sentences. When using these machine translation hidden vectors as inputs to another NLP model, we refer to them as context vectors (CoVe).

Figure 11: A general overview of how we a) train an encoder and b) reuse it as part of a new model.

Experimenting with CoVe

Our experiments explore the advantages of using pretrained MT-LSTMs to generate CoVe for text classification and question answering models, but CoVe can be with any model that represents its inputs as a sequence of vectors.


Figure 12: A Biattentive Classification Network.

We work on two different kinds of text classification tasks. The first kind, which includes sentiment analysis and question classification, has a single input. The second kind, which only includes entailment classification, has two inputs. We use the Biattentive Classification Network (BCN) for both. If there is only one input, we copy it over, pretend there are two, and let the model know to avoid running redundant computation. It is not necessary to understand the details of the BCN to understand CoVe and the benefits of using them.

Question answering

We rely on the Dynamic Coattention Network (DCN) for question answering experiments. For experiments that analyze the effect of MT datasets on the performance of models learning other tasks, we use a slightly modified DCN, but experiments testing the overall effectiveness of CoVe and CoVe together with character vectors, we use the udpated DCN+.

Dataset Task Details
SST-2 Sentiment Classification 2 classes, single sentences
SST-5 Sentiment Classification 5 classes, single sentences
IMDb Sentiment Classification 2 classes, multiple sentences
TREC-6 Question Classification 6 classes
TREC-50 Question Classification 50 classes
SNLI Entailment Classification 2 classes
SQuAD Question Answering open ended

Table 1: A summary of datasets and tasks in our experiments.


For each task, we experiment with the different ways we have of representing input sequences. We can represent each sequence as a sequence of randomly initialized word vectors that we train, we can use GloVe, and we can use GloVe together with CoVe. In the last case, we take the GloVe sequence, run it through a pretrained MT-LSTM to get CoVe sequence, and we append each vector in the CoVe sequence with the corresponding vector in the GloVe sequence. Neither the MT-LSTM nor GloVe are trained as part of the classification or question answering models.

Experimental results show that including CoVe alongside GloVe always improves performance over both randomly initialized word vectors and using GloVe alone.

Figure 13: Validation performance is improved by starting with GloVe and adding CoVe.

More MT → better CoVe

Varying the amount of data used to train the MT-LSTM shows that training with a larger dataset leads to a higher quality MT-LSTM, where higher quality in this case means that using it to generate CoVe tended to yield better performance on the classification and question answering tasks.

Results show that the gains of using CoVe from MT-LSTMs trained with less MT training data are less pronounced, and in some cases using these small MT datasets to train the MT-LSTM yields CoVe that actually hurt performance. This might suggest that the benefits of using CoVe come from using a non-trivial MT-LSTM. It might also suggest that the domain of the MT training set has influence on which tasks the resulting MT-LSTM will provide benefits.

Figure 14: Training set size for the MT-LSTM has a noticeable influence on the validation performance of models using CoVe. Here, MT-Small is the 2016 WMT multimodal dataset, MT-Medium is the 2016 IWSLT training set, and MT-Large is the 2017 WMT news track training set.

CoVe and characters

In these experiments, we try adding character vectors to GloVe and CoVe. Results show that for some tasks, the character vectors can work with GloVe and CoVe to yield even greater performance. This suggests that CoVe adds information that is complementary to character- and word-level information.

Figure 15: CoVe is complementary with the character-level information stored in character vectors.

Test performance

All of our best models used GloVe, CoVe, and character vectors. We took the model that achieved the highest validation performance for each task, and we tested these models on the test sets. The charts above show that adding CoVe always boosts the performance of our models over our starting point, and the table below shows that this was enough a boost to push our starting model to new state of the art performances on the test sets of three out of seven of the tasks.

Task Prior State of the Art Ours
SST-2 91.8 (Radford et al., 2017) 90.3
SST-5 53.1 (Munkhdalai and Yu, 2016b) 53.7
IMDb 94.1 (Miyato et al., 2017) 91.8
TREC-6 96.1 (Zhou et al., 2016) 95.8
TREC-50 91.6 (Van-Tu and Anh-Cuong, 2016) 90.2
SNLI 88.0 (Chen et al., 2016) 88.1
SQuAD 82.5 (Wang et al., 2017) 82.8

Table 2: Test performance comparison to other machine learning approaches at time of testing (7/12/17).

It is interesting to note that, just as we use machine translation data to improve our models, the state-of-the-art models for SST-2 and IMDb also use data outside the supervised training sets. For SST-2, the top model makes use of 82 million unlabeled Amazon reviews, and the top model for IMDb uses 50,000 unlabeled IMDb reviews in addition to the 22,500 supervised training examples. Both of these approaches augment with data that is much more similar to the target task than is the machine translation datasets we used. The superiority of those models might highlight the connection between the kind of additional data and the extent to which that additional data will be beneficial.


We showed how training a neural network to tranlsate enables it to learn representations of words in context, and we showed that we can use part of that network, the MT-LSTM, to help networks learning other tasks in NLP. The context vectors, or CoVe, that the MT-LSTM provides to classification and question answering models propels them to better performance. The more data we use to train the MT-LSTM, the more pronounced the improvement, which seems to be complementary to improvements that come from using other forms of pretrained vector representations. By combining the information from GloVe, CoVe, and character vectors, we were able to boost the performance of our baseline models on a variety of NLP tasks.

Code release

We hope that making our best MT-LSTM (the one that we used to generate CoVe for all of our best models) available will encourage further exploration of reusable representations in NLP. The code includes an example of how to generate CoVe in PyTorch.

Citation credit

If you would like to dive further into the details, or if you end up using this post or the associated code in published work, please cite:

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
Learned in Translation: Contextualized Word Vectors
We use cookies to make interactions with our websites and services easy and meaningful, to better understand how they are used and to tailor advertising. You can read more and make your cookie choices here. By continuing to use this site you are giving us your consent to do this.