16 BERT Fine-Tuning

The training steps taken by BERT are similar to prior techniques: Parikh et al. (2016); Seo et al. (2017); however, they contribute a novel approach to combining the technique of independently encoding text pairs before applying bidirectional cross-attention. BERT’s input structure of A B sentences allows the self-attention mechanism to unify these two steps of encoding bidirectional cross attention between two sentences.

Thus, BERT is well equipped to succeed on a number (11) of NLP tasks. For each task, the paper plugged in the task specific inputs and outputs and fine-tuned all parameters end-to-end. Just as sentence A and B are used in pre-training, this matches to a number of fine tuning tasks, such as paraphrasing and question answering.

The cost of fine-tuning is cheap relative to pre-training. The paper quotes an hour on a TPU, or just a few hours on a GPU cluster.

BERT’s ease to train these fine-tune tasks shows the power of general language representation, and the versatility of this language model in general.