Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
In the BERTOLOGY series, I’m trying to make my way through all the Bertology papers on Huggingface.co. So this paper looks to be about transferring checkpoints particularly suiting them toward Sequence Generation Tasks.
- What’s new in the paper?
- Do you have a clear overview about what the paper is all about?
Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing.
The opening sentence does a good job at placing the context. NLP w/ major pre-training.
In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both encoder and decoder, with these check-points.
Since this paper doesn’t have a catchy title, it is a bit hard to determine if this paper is proposing a model or a pre-training technique.
Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.
Always a good sign.
3. Look at the images and extract a set of “questions” about what is not clear about their method from the images. Now your job is to answer these questions by reading the paper.
This paper has no images. They must have been too busy with text generating the paper to get to pictures. Oh well.
4. Read the method aiming to answer your “questions” about the paper. Focus on understanding only the things relevant for the story (i.e., to understand the contribution).
We aim to provide an empirical answer to the following research question: what is the best way to leverage publicly available pre-trained checkpoints for warm-starting sequence generation models?
…"main contribution of this paper is rigorously experiment with large number of different settings to combine BERT, GPT, and RoBERTa pre-trained checkpoints to initialize our Transformer-based model.
Overall, we have run over 300 experiments spending thousands of TPU v3 hours to better accommodate the language modeling and understanding capabilities of these pre-trained models for text generation.
This is another way of saying there are a lot of details, which may or may not be hard to decipher.
5. Read the experiments to convince you that the show results are caused by their claim. Be aware that the experiments highlighted are the best scenarios and are fully hyper-parameter tuned.
There are a lot of these, most of which are hard to summarize. There are a few conclusions they state at the end which I will repeat.
6. Make sure you answered all your questions. Did the authors convince you that their story has the effect that they claim?
They didn’t have a strong claim, and actually seemed to just try to present their experimental data. I don’t doubt that they have run good experiments but I struggle to synthesize it into any actionable items. From their conclusion, they make a few points:
- Combining BERT and GPT-2 underperforms
- Combining RoBERTa and GPT-2 achieves strong results
- Sharing weights between encoder and decoder is often a good memory trade