Multilingual Sentence Embeddings for Zero-Shot Transfer – Applying a Single Model on 93 Languages7 min read

Language models and transfer learning have become one of the cornerstones of NLP in recent years. Phenomenal results were achieved by first building a model of words or even characters, and then using that model to solve other tasks such as sentiment analysis, question answering and others.

While most of the models were built for a single language or several languages separately, a new paper – Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond – presents a different approach. The paper uses a single sentence encoder that supports over 90 languages. Its language model is trained on a dataset that contains sentences from all these languages. However, it can be utilized for a given task by only training the target model (e.g. classifier) on a single language, a technique named Zero-Shot.  

This Universal language model achieves strong results across most languages in tasks such as Natural Language Inference (classifying relationship between two sentences), that are state-of-the-art for Zero-Shot models. This technique is faster to train and has the potential to support hundreds of languages with limited training resources.


Most of the predictive algorithms in Natural Language Processing (NLP) aren’t capable of processing raw text directly, as it’s non-numeric and unstructured. A popular way to overcome this is by creating a language model in which characters, words or sentences are translated into a meaningful vector, i.e. embedding vector. The embeddings can be fed to a prediction model, as a constant input or by combining the two models (language and prediction) and fine-tuning them for the task.     

In most models, every supported language requires an additional language model as well as additional training for every task. The models tend to be data hungry and require huge datasets, sometimes with billions of words. A different approach is to first train a single language model for all languages together. Later on, for any given task, it would be sufficient to train on a dataset of a single language to receive good results for all languages, as the model is able to generalize many languages and apply the learned knowledge on them as well. The advantages of this technique, named Zero-Shot, are simplicity (one model for all) and efficiency especially due to faster training.


To train the language model, the authors created a comprehensive dataset that consists of sentences and their translation in additional languages. The dataset is based on multiple sources:

  • Professional translations – the European Union dataset (Europarl) for 21 languages, the United Nations dataset in 6 languages and Quran translations in 42 languages
  • User-generated translations – OpenSubtitles with movie subtitles in 57 languages, and Tatoeba, a community-based dataset with English sentences translated to hundreds of languages.

The final dataset includes 223 million parallel sentences in 93 languages from 34 families (e.g. Germanic and Semitic) and 28 scripts (Latin to Hebrew).

How it works

The model includes two parts – the language model training and its Zero-Shot utilization for several NLP tasks.

Language model training

The language model uses a standard architecture for machine translation with an encoder, which generates a vector representation for a sentence in one language, and a decoder that tries to translate the sentence vector to the target language. A key feature of the model is its use of a single network for all languages.

The encoder includes two parts:

  1. Byte-Pair Encoding (BPE) – An algorithm that pre-processes the entire dataset and generates a dictionary of the most frequent character sequences in it. The BPE module in the encoder converts the sentence input into subwords from the pre-built dictionary. When working with numerous languages, BPE decreases significantly the vocabulary size and increases the shared area (subwords) between languages. See Appendix A for more details.
  2. LSTM layers – Standard Recurrent Neural Network, with five layers of LSTM modules (of size 512), that generates the sentence embedding by max pooling on the last layer. Each sentence is also processed in reversed order, i.e. Bidirectional LSTM, and the final sentence embedding is a concatenation of both directions.

The decoder then tries to predict iteratively the next word based on the sentence embedding, the previous output of the LSTM module (embedded with BPE) and the target language ID. The decoder has only one LSTM layer (of size 2048) and a softmax layer to predict the most probable word. The sole purpose of the decoder is to train the encoder and it isn’t used afterward.

Model architecture (source: Artetxe et al.)

To avoid quadratic cost as the number of learned languages increases, the model is trained only with two target languages (English and Spanish) and not all-vs-all.

Per-task training

The trained encoder can be used for solving other NLP tasks:

  1. Natural Language Inference (NLI) – Deciding if the relationship between two sentences, a premise (p) and a hypothesis (h), is an entailment, contradiction or neutral. By combining the sentences in the following way (p, h, p·h, |p−h|) and using that as an input to train a small neural network with two layers, the model learns to predict the relationship.  
  2. Topic classification – Classifying short texts (such as news articles) to a given list of topics, by embedding a text with the encoder and training a single layer network (with 10 units and a softmax unit afterward).
  3. Similar sentence identification – An even simpler usage is to find the translation of given a sentence from a dataset of sentences in another language. This task only requires to encode all sentences and calculate the distance between them with cosine similarity or a more sophisticated metric (as proposed in the paper).   

The model is trained only on sentences in English and then tested on all languages (Zero-Shot). In addition, the encoder is constant and not fine-tuned for every task.


The paper presents the model results on the XNLI dataset that includes sentences in 14 languages for the NLI task. The model achieves state-of-the-art results when compared to other models that were trained as Zero-Shot as well. For example, compared to Zero-Shot BERT, the proposed model reaches better results in most languages.

Comparison of XNLI accuracy between the proposed model and three other Zero-Shot models, BERT among them. All models were trained only on English sentences and tested on up to 14 languages. (Source: Artetxe et al.)

On the other end, when training BERT for each language separately (by translating the training data to the target language), its results are superior to the proposed model with Zero-Shot configuration.

In addition, the model also presents state-of-the-art results in classification task on the MLDoc dataset (Reuters news articles) and in similar text identification on the BUCC dataset. The model results across many languages show the consistency of its language model and the benefit of low-resource languages from using a single multilingual model.

Note: Some tasks (BUCC, MLDoc) tend to perform better when the encoder is trained on long and formal sentences, whereas other tasks (XNLI, Tatoeba) benefit from training on shorter and more informal sentences.

Compute & Implementation

The model was implemented in Pytorch using fairseq for the encoder and the decoder, and will be open-source. The language model was trained with 16 NVIDIA V100 GPUs for about 5 days.


The Multilingual Sentence Embeddings presents a novel technique for creating language models, which is faster, simpler and scalable. It can easily be fitted to new languages and new tasks while achieving strong results across many languages. It can even work with “unknown” languages that aren’t part of the language model.

However, when efficiency and scalability are less important than accuracy, it seems that this model, and Zero-Shot models in general, are inferior to fine-tuned models such as BERT. Interestingly, the authors plan to borrow concepts from the BERT architecture, e.g. using its Transformer instead of the BiLSTM module, to improve their model. A different approach would be to fine-tune the model for specific languages and compare the results.

Special thanks to Mikel Artetxe, one of the paper’s authors, for his insights on the workings of model.

Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning

Appendix A – BPE

Byte Pair Encoding (BPE) is a data compression technique that iteratively replaces the most frequent pair of symbols (originally bytes) in a given dataset with a single unused symbol. In each iteration, the algorithm finds the most frequent (adjacent) pair of symbols,  each can be constructed of a single character or a sequence of characters, and merged them to create a new symbol. All occurences of the selected pair are then replaced with the new symbol before the next iteration. Eventually, frequent sequences of characters, up to a whole word, are replaced with a single symbol, until the algorithm reaches the defined number of iterations (50k in this paper). During inference, if a word isn’t part of the BPE’s pre-built dictionary, it will be split into subwords that are.
An example code of BPE can be found here.

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *