HMTL – Multi-task Learning for solving NLP Tasks7 min read

The field of Natural Language Processing includes dozens of tasks, among them machine translation, named-entity recognition, and entity detection. While the different NLP tasks are often trained and evaluated separately, there exists a potential advantage in combining them into one model, i.e., learning one task might be helpful in learning another task and improve its results.

Hierarchical Multi-Task Learning model (HMTL) provides an approach to learn different NLP tasks by training on the “simple” tasks first, and using the knowledge to train on more complicated tasks. The model presents state-of-the-art performance in several tasks and an in-depth analysis of the importance of each part of the model, from different aspects of the word embeddings to the order of the tasks.  


Several papers from recent years showed that combining multiple NLP tasks can generate better and deeper representation of text. For example, identifying entities in a sentence, such as names of locations or people, can help with finding mentions of them in subsequent sentences. However, not all NLP tasks are related, and it’s essential to select relevant tasks that can be beneficial for other tasks.

The HMTL model focuses on four different tasks: Named Entity Recognition, Entity Mention Detection, Coreference Resolution, Relation Extraction.

  1. Named Entity Recognition (NER) – Identify types of entities in text (e.g. Person, Organization, Location, etc.)
  2. Entity Mention Detection (EMD) – An extended version of NER, identifying any mention related to real life entities even if it’s not a name.
  3. Coreference Resolution (CR) – Identifying and grouping mentions of the same entity.
  4. Relation Extraction (RE) – Identifying entities and classifying the type of relation between them (if exists). The types of relations can be found here. Due to the semantic similarity between RE and CR, they are both on the same hierarchical level.

The following text illustrates the difference between the tasks (a great demo can be found here):
When we were in Spain, my mom taught me how to drive with a car. She also explained how to fuel it

EMD(PERSON, my mom), (VEHICLE, car), (PERSON, we), (PERSON, She)
CR(My mom, She), (a car, it), (My, me)
RE(PHYS, We, Spain), (PER-SOC, My, mom), (ART, my mom, a car)

All four tasks are related to identifying entities in a text and the relations between them, with different levels of complexity – while NER is the simplest one, CR and RE require a deeper understanding of the text. Therefore, learning one task might help learn others.

The HMTL Model

HMTL is a hierarchical model in which initially simpler tasks, such as NER, are learned, and their results are then used to train the following tasks. Each task is built from three components: Word embeddings, Encoder, and Task-specific layer.

The base of the model is the word representation that embeds each word from the input sentence into a vector using three models:

  1. GloVe – Pre-trained word embeddings. Words in this model have no context and a given word will always be represented by the same vector.
  2. ELMo – Pre-trained contextual word embeddings. A vector representation of a word depends not only on the word itself, but also on the rest of the words in the sentence. ELMo is one of the top performing models in the GLUE Benchmark.
  3. Character-level word embeddings – A convolutional neural network that learns to represent words based on character level features. This kind of representation is more sensitive to morphological features (prefix, suffix, etc) which are important in understanding relations between entities.

In addition, each task is trained with a dedicated encoder – a multi-layer recurrent neural network that generates word embeddings tailored for the task. The encoder is implemented using bidirectional GRU-cells networks and its output is a concatenation of the last layer of the forward and backward networks. The input of the encoder consists of the base word representation and the output of the previous task’s encoder (when available).

On top of the encoders, each task uses a different neural network as described below:

  1. The first two levels (NER & EMD) use a Conditional Random Field that predicts the entity type of a word based on its neighbours’ types. The concept behind this algorithm is that it finds the optimal combination of entities for all the words a in sentence together. A good explanation this algorithm can be found here.
  2. In the Coreference Resolution (CR), the model first calculates the likelihood of each sequence of words (“span”) being a mention of a predecessor span, for example, a pronoun is more probable to be a mention than a verb. It then picks the top N spans and calculates a score for each combination of spans to be a pair. Each span can be at most a mention of one span, achieved by using softmax. A dummy token is added for cases that no pair is found.
  3. The Relation Extraction (RE) task uses a layer that calculates the probability of each pair of tokens to match each relation type (in total, T^2 * R_types probabilities). The model uses a sigmoid function and not softmax to allow multiple relations for each token.
The HMTL model (Source: Sanh et al.)

One of the challenges in training a hierarchical model is catastrophic forgetting, in which training new tasks causes the model to “forget” previous tasks and achieve a degraded performance on them. HMTL deals with catastrophic forgetting by randomly picking one of the previous tasks during the current task’s training (after each parameters update), and training the model on a random sample from the random task dataset. The probability of picking a task for training isn’t uniform but proportional to the size of its dataset, a technique which the authors found to be more effective.


The model was trained on several datasets for comparison, with two key datasets – OntoNotes 5.0 for NER and ACE05 for the rest of the tasks. ACE05 was used in two configurations – regular and Gold Mentions (GM), with the GM configuration consisting of two parts:

  1. The Coreference Resolution (CR) task was evaluated based on gold mentions which are extracted by human and not by automated mentions. These mentions are more expensive to produce and are not available for most datasets. According to the paper, using gold mentions at evaluation improves the CR’s performance.
  2. Training the CR task with a different split of the same dataset (ACE05) used to train the RE and EMD tasks. Using a different split can help the model to learn a richer representation.


The paper claims state-of-the-art results in Entity Mention Detection (EMD) and Relation Extraction (RE) by training the full model using the Gold Mention (GM) configuration. According to the paper, using the GM configuration in training improves the F1-score of the CR task by 6 points, while it improves the EMD and RE tasks by 1-2 points.

The paper also claims to achieve state-of-the-art results in Named-Entity Recognition, although it seems that the recent BERT model reached slightly better results. However, it’s hard to compare the two since the HMTL model wasn’t fine-tuned for the dataset used by BERT. A summary of BERT can be found here.

Another interesting result from the paper is a reduction in the required training time to reach the same performance. The full model (with GM) needs less time than most single tasks – NER (-16%), EMD (-44%) and CR (-28%) – while requiring more time than RE (+78%).

A possible concern regarding the GM configuration is “information leakage”- due to the different split, records that are used for training one task might later be used as a test for another task. The knowledge regarding those records might be stored in one of the shared layers, allowing for artificially improved results.

Ablation Study

Task Combinations
To gain a deeper understanding of the hierarchical approach, the paper compares the results of different combinations of tasks without the GM configuration, as shown in the table below (F1 scores of several configurations). It appears that the contribution of multi-task training is inconclusive and depends on the task:

  1. Different tasks achieved their best results with different task combinations, meaning there is no one dominant combination.
  2. In the low-level tasks, the benefit of the hierarchical model is small (less than 0.5 F-1 points).
  3. The biggest improvement was achieved in the RE task, with over 5 F-1 points. A possible explanation is that the EMD task is trained before the RE task and learns to identify almost the same entities as the RE task.
Task combinations comparison

Word representation
As mentioned previously, the base of the model is the word representation, which consists of three models – GloVe, ELMo and Character-level word embeddings. The selection of these models also has a significant effect on the model performance, as shown in the table below. Elmo embeddings and character-level embeddings add 2-4 points each to the F-1 score of most tasks.

Comparison of Word representation (Source: Sanh et al.)


The paper presents an interesting technique of combining seemingly separate NLP tasks and techniques to achieve top results in language analysis. The results emphasize the need for further research into the field as it’s currently difficult to understand when a specific NLP task can be useful to improve results in an unrelated NLP task.

Special thanks to Victor Sanh, one of the paper’s authors, for valuable insights on the workings of HMTL.

Sign up to our weekly newsletter
Stay updated with the latest research in Deep Learning

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *