BERT simply put is a language representation model based on self-attention (Transformer) pre-trained bidirectionally. The main difference is the move to bidirectional pretraining over only unidirectional mechanism used in the original Transformer architecture.
Language model pretraining has been shown to be effective for improving many language tasks. There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning.
- The feature-based approach is employed by ELMo, which uses task-specific architectures which use pre-trained representations as additional features.
- The fine-tune based approach is employed by GPT-1, and it requires only fine-tuning of all parameter weights for each specific task.
The paper argues that the current technique of only unidirectional pre-training restricts the power of pre-trained representations. The transformer only looks back at past context (to the left) and does not account for words to the right of the current context, which for sentence level tasks is suboptimal.