This paper introduces a number of proposed techniques to improve SoTA results. It’s amazing to see such a number of new proposals. A new of them listed here:
- Slanted triangular learning rates
- Concat pooling
- Gradual unfreezing
- BPTT for Text Classification
I want to focus on Gradual unfreezing, but the essence of the paper is a fundamental idea of transferring knowledge from a pre-trained set, into a specialized learner. By doing this, you leverage the previous language model, but save incredible amounts of time as well.
Section 4.2 (Results) quote two orders of magnitude reduction of data for statistically similar results on the TREC-6 dataset.
To achieve good transfer of knowledge to target learning, the fine tuning step is critical. Too aggressive fine-tuning will catastrophically forget (across largely spaced text portions), while too slow will lead to slow convergence and overfitting.
thus the proposed solution is to gradually unfreeze along the most specified knowledge layers, and gradually unfreezing until you are training at the most general knowledge layers (the “edges” and “corners” detectors). This is shown in conjunction with triangular learning rates and gradual unfreezing to enable diverse specialization from a single Language Model.