XLM – Enhancing BERT for Cross-lingual Language Model5 min read

Attention models, and BERT in particular, have achieved promising results in Natural Language Processing, in both classification and translation tasks. A new paper by Facebook AI, named XLM, presents an improved version of BERT to achieve state-of-the-art results in both types of tasks.

XLM uses a known pre-processing technique (BPE) and a dual-language training mechanism with BERT in order to learn relations between words in different languages. The model outperforms other models in a cross-lingual classification task (sentence entailment in 15 languages) and significantly improves machine translation when a pre-trained model is used for initialization of the translation model.  


XLM is based on several key concepts:

Transformers, invented in 2017, introduced an attention mechanism that processes the entire text input simultaneously to learn contextual relations between words (or sub-words). A Transformer includes two parts – an encoder that reads the text input and generates a lateral representation of it (e.g. a vector for each word), and a decoder that produces the translated text from that representation. A great in-depth review of Transformers can be found here.

While the vanilla Transformer has only limited context of each word, i.e. only the predecessors of each word, in 2018 the BERT model took it one step forward. It uses the Transformer’s encoder to learn a language model by masking (dropping) some of the words and then trying to predict them, allowing it to uses the entire context, i.e. words to the left and right of a masked word.

Due to the concurrent processing of all tokens in the attention module, the model needs more information about the position of each token. By adding a fixed value to each token based on its position (e.g. sinusoidal function) – a step named Positional Encoding – the network can successfully learn relations between tokens. Our summary of BERT can be found here.

In 2018, Lample et al. presented a translation model that combines Transformers and statistical phrase-based model (PBSMT). The latter is a probabilities table for pairs of phrases in different languages. An important concept in the paper is Back-Translation, in which a sentence is translated to the target language and back to the source. This concept enables using monolingual datasets, which are bigger and more common than bilingual datasets, in a supervised manner. One of the conclusions of Lample et al. is that initialization of the token embeddings is of high importance for the success of the model, especially when using Back-Translation. While the authors used a “simple” word embeddings using FastText, they suggest that “more powerful language models may further improve our results”.

How XLM works

The paper presents two innovative ideas – a new training technique of BERT for multilingual classification tasks and the use of BERT as initialization of machine translation models.

Cross-lingual BERT for classification

Tough BERT was trained on over 100 languages, it wasn’t optimized for multi-lingual models – most of the vocabulary isn’t shared between languages and therefore the shared knowledge is limited. To overcome that, XLM modifies BERT in the following way:

First, instead of using word or characters as the input of the model, it uses Byte-Pair Encoding (BPE) that splits the input into the most common sub-words across all languages, thereby increasing the shared vocabulary between languages. This is a common pre-processing algorithm and a summary of it can be found here.

Second, it upgrades the BERT architecture in two manners:

  1. Each training sample consists of the same text in two languages, whereas in BERT each sample is built from a single language. As in BERT, the goal of the model is to predict the masked tokens, however, with the new architecture, the model can use the context from one language to predict tokens in the other, as different words are masked words in each language (they are chosen randomly).
  2. The model also receives the language ID and the order of the tokens in each language, i.e. the Positional Encoding, separately. The new metadata helps the model learn the relationship between related tokens in different languages.

The upgraded BERT is denoted as Translation Language Modeling (TLM) while the “vanilla” BERT with BPE inputs is denoted as Masked Language Modeling (MLM).

The complete model was trained by training both MLM and TLM and alternating between them.

Comparison of a single language modeling (MLM) similar to BERT, and the proposed dual-language modeling (TLM). Source: XLM

To assess the contribution of the model, the paper presents its results on sentence entailment task (classify relationship between sentences) using XNLI dataset that includes sentences in 15 languages. The model significantly outperforms other prominent models, such as Artetxe et al. and BERT, in all configurations – train only on English and test on all (Zero-Shot), train on translated data to English (Translate-Train), train on English, and test on translated data (Translate-Test). These results are considered state-of-the-art.  

Comparison of XNLI results (accuracy) of prominent models in different training and testing configurations. Each column represents a language. Source: XLM

Initialization of translation models with MLM

The paper presents another contribution of BERT, and more precisely of the MLM model –   as a better initialization technique for Lample et al. translation model. Instead of using FastText embeddings, the initial embeddings of the tokens are taken from a pretrained MLM and fed into the translation model.

By using these embeddings to initialize the tokens of both the encoder and the decoder of the translation model (which uses Transformer), the translation quality improves by up to 7 BLEU as shown in the table below.

Translation results with different initialization techniques. CLM stands for Causal Language Modeling in which a given word is trained based only on the previous words and not using the masking technique. Source: XLM

Note: The paper also shows that training a cross-lingual language-model can be very beneficial for low-resource languages, as they can leverage data from other languages, especially similar ones mainly due to the BPE pre-processing. This conclusion is similar to the one from Artetxe et al. (Our summary can be found here).

Compute considerations

The models are implemented in PyTorch and can be found here, including pretrained models. The training was done with 64 Volta GPUs for the language modeling tasks and 8 GPUs for the translation tasks, though the duration isn’t specified. Exact implementation details can be found in section 5.1 and 5.2 of the paper.


As in many recent studies, the paper shows the power of language models and transfer learning, and BERT in particular, to improve performance in many NLP tasks. By using simple but smart tweaks of BERT it can outperform other cross-lingual classification models and significantly improve translations models.

Interestingly, the translation model used in the paper and the MLM model that was used for initialization are both based on Transformer. It’d be safe to assume that we’ll see more combinations of this kind, such as using the new Transformer-XL for initialization.

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *