The training steps taken by BERT are similar to prior techniques: Parikh et al. (2016); Seo et al. (2017); however, they contribute a novel approach to combining the technique of independently encoding text pairs before applying bidirectional cross-attention. BERT’s input structure of A
B
sentences allows the self-attention mechanism to unify these two steps of encoding bidirectional cross attention between two sentences.
Thus, BERT is well equipped to succeed on a number (11) of NLP tasks. For each task, the paper plugged in the task specific inputs and outputs and fine-tuned all parameters end-to-end. Just as sentence A
and B
are used in pre-training, this matches to a number of fine tuning tasks, such as paraphrasing and question answering.
The cost of fine-tuning is cheap relative to pre-training. The paper quotes an hour on a TPU, or just a few hours on a GPU cluster.
BERT’s ease to train these fine-tune tasks shows the power of general language representation, and the versatility of this language model in general.