16 BERT Pre-training

The BERT training is done in two macros: Pre-training and Fine-tuning. This is a look at Pre-training.

Before getting to Pre-training though, I want to cover the Input/Output representations. The input representation for the model was chosen so they could flexibly address a number of NLP tasks, being able to unambiguously define both a single sentence and a pair of sentences with a separator. This is important for a language model, because it needs to be able to reason at the connection between sentences.

Each sequence always starts with the special classification token [CLS]. This is used downstream also for the vector used in classification tasks (ie Correct-Next-Sentence or Not).

Each sequence is modeled using WordPiece embedding with a 30k token vocab. Sentence pairs are packed together in two ways: first, separated by [SEP] token, second by the added learned embedding to every token differentiating between sentence A and B.

BERT’s input representation is shown in Fig2 below.


Task #1: Masked LM

BERT has an advantage because it is ready to accept the masked tokens because of the parallel architecture. Thus BERT is trained with a deep bidirectional representation by simply masking some percentage (15%) of the input tokens at random, then lean on the prediction of those masked tokens. This masked process is referred in literature as a Cloze task, but here they refer to it as a “masked LM” (MLM). The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary.

Although the [MASK] token allows for a bidirectional pre-trained model, a downside, is the mismatch of pre-training and fine-tuning because the [MASK] token does not appear in normal language. Thus they use some techniques to mitigate this issue.

Task #2: Next Sentence Prediction (NSP)

Many important downstream tasks require an understanding of the relationship between sentences, not just the relationship of words within a sentence. Thus the second part of pre-training involves training a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. Thus they train on a collection of sentence pairs, some which occur consecutively in a document while others do not, and the model is trained on if sentence B is classified as IsNext or NotNext.

Pretraining data

BERT is pre-trained with both BooksCorpus (800M words) and English Wikipedia (2.5B words). BERT requires the document-level corpus in order to train with NSP.