BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Link to the original research paper : click here
Abstract
BERT stands for Bidirectional Encoder Representations from Transformers.
This algorithm is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layer.
⇒ As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state of the art models for a wide range of tasks, such as :
without substantial task-specific architecture modifications.
1. Introduction
⇒ Language model pre-training has been shown to be effective for improving many natural language processing tasks.
There are two existing strategies for applying pre-trained language representations to downstream tasks :
feature based
fine-tuning
The feature-based approach uses task specific architectures that includes the pre-trained representation as additional features.
The fine-tuning approach introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.
⇒ Current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional and this limits the choice of architecture that can be used during pre-training
BERT alleviates the previously mentionned unidirectionality constraint by using a "masked language model" ( MLM ) pre-training objective.
⇒ The MLM randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary if of the masked word based only on its context.
In addition to the MLM, we also use a "next sentence prediction" tasks that jointly pre-trains text-pair representations.
2. Related Work
3. BERT
There are two steps in our framework :
During pre-training, the model is trained on unlabeled data over different pre-training tasks.
For fine-tuning, the BERT model is first initialized with the pretrained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.
↳ Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.
A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.
Model Architecture
BERT Base ( L = 12, H = 768, A = 12, Total Param = 110M )
BERT Large ( L =24, H = 1024, A = 16, Total Param = 340M )
with :
n.b. BERT Base was chosen to have the same model size as OPENAI GPT for comparison purpose.
Pre-training BERT
We pre-train BERT using two unsupervised tasks :
# Task 1 : Masked LM
In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.
⇒ We refer to this procedure as "masked LM"
details :
We mask 15% of all WordPiece tokens in each sequence at random ➝ we only predict the masked word
Problem : Creation of a mismatch between pre-training and fine-tuning since the [MASK] token does not appear during fine-tuning.
Solution : We replace the masked word by :
Then Ti will be used to predict the original token with cross entropy loss.
#Task 2 : Next sentence prediction ( NSP )
Many important downstream tasks such as Question Answering ( QA ) and Natural Language Inference ( NLI ) are based on understanding the relationship between two sentences, which is not directly captured by language modeling.
⇒ In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
Pre-training data
For the pre-training corpus we use the BooksCorpus ( 800M words ) and English Wikipedia ( 2,500M words).
Fine-tuning BERT
For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention.
BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross-attention between two sentences.
↳ For each task, we simply plug in the task specific inputs and outputs into BERT and finetune all the parameters end to end.
Conclusion
Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource task to benefit from deep unidirectional architectures and this paper generalize these findings to deep bidirectional architectures.