Pre-trained language models
Les auteurs de "A Survey of Large Language Models" distinguent les "Pre-trained language models (PLM)" des Large language models (LLM).
ELMo et BERT appartiendraient à la 1ère catégorie, un peu comme les pionniers des LLM.
As an early attempt, ELMo was proposed to capture context-aware word representations by first pre-training a bidirectional LSTM (biLSTM) network (instead of learning fixed word representations) and then fine-tuning the biLSTM network according to specific downstream tasks.
Further, based on the highly parallelizable Transformer architecture [22] with self-attention mechanisms, BERT [23] was proposed by pre- training bidirectional language models with specially de- signed pre-training tasks on large-scale unlabeled corpora.
La différence entre les PLM et les LLM c'est l'augmentation de taille des données et des modèles, qui a pour effet d'améliorer significativement les performances sur les tâches.
Researchers find that scaling PLM (e.g., scaling model size or data size) often leads to an improved model capacity on downstream tasks