16 BERT Experiments

### GLUE

The General Language Understanding Evaluation (GLUE) benchmark is a collection of natural language understanding tasks. GLUE was fine-tuned by adding only an additional classification layer weights $W \in \mathbb{R}^{KxH}$ where $K$ is the number of labels. Standard classification loss is computed with $C$ and $W$. They found that fine-tuning was sometimes unstable on small datasets, so they mitigated the problem with random restarts and manually selecting best models.

Results are shown in Table 1. BERT outperforms all previous systems on all tasks by a significant margin.

The Stanford Question Answering Dataset is a collection of 100k question/answer pairs. The task is to predict the answer span given a passage. BERT is augmented only with single additional start $S$ and end $E$ vectors. Table2 shows the leaderboard results at the time of the paper’s writing.

To account for this potential situation, the BERT model treats the questions that do not have an answers as having an answer span with start and end at the [CLS] token. The model saw a +5.1 boost in F1 over previous best.