A key property of transfer learning is the idea of low-shot learning. This enables you to train similar or better models, with 1-2 orders of magnitude (50x - 100x) less data examples. That is amazing!
On ImDb and AG, the paper posts results that ULMFiT with only 100 labeled examples matches the performance of training from scratch with 10x and 20x more data, demonstrating the benefits of general domain LM pretraining
In every instance, pretraining improves performance and data needs. Regardless of target dataset size.
It matters, but a vanilla LM still provides benefits.
The impact of classifier fine-tuning scales about a factor of 2+ on IMDb and TREC-6 is less, but still significant. See table below.