Key Insights

This paper presents a general purpose training procedure that can be applied to a variety of NLP tasks, using task instructions and task input as conditioning factors.
A model trained with a massive, diverse, and unsupervised dataset can handle many tasks in a zero-shot manner and typically outperforms SOTA task-specific models.

Problems of Previous Methods

Heavily supervised methods require large amounts of labeled data; however, some tasks lack high-quality labeled data or a sufficient volume of it.
Previously, different training objectives needed to be formulated for each task.
As training was specific to a certain task and narrow dataset, the model had limited generalization and transfer capabilities.

Advantages of This Method

The model utilizes self-supervised learning, eliminating the need for labeled data.
The training procedure and objective are versatile, applicable to both pre-training and downstream tasks.
It can handle many tasks in a zero-shot approach.
No changes to the model’s architecture are required.

Approach/Core Idea

This paper treats language modeling as a conditional generation problem. It takes advantage of language’s inherent sequential order: subsequent tokens are conditioned on previous ones. The goal is to maximize the sequence’s probability.
$P(x) = \prod_{i=1}^n P(x_i|x_{<i})$
Text generation is conditioned on the task instruction and task input, both presented in text format.
The training corpus must be diverse enough to encompass language distribution and large enough to provide sufficient training data.

Method

About Input

The necessary corpus characteristics include:
- Diversity: The corpus should not be limited to just dialogues, news, or novels.
- Logical coherence: The text should reflect the underlying rules of the language. Common crawl data has too much noise prior to filtering.
- Volume and ease of collection: The corpus should be both extensive and easily collectable.
WebText: This paper uses links in Reddit as a data source. As these links are user-selected, they offer high quality at a low cost.
Wikipedia is kept separate to avoid duplication.

Encoding Strategy: Byte Pair Encoding (BPE)

What is BPE:
- To establish a vocabulary table for tokenization, the BPE algorithm treats text as a byte sequence and merges byte sub-sequences based on frequency.
Advantages of BPE:
- It can encompass all strings, encompassing unknown and artificial words.
- It doesn’t require lossy preprocessing like casting all words to lowercase.
- This method strikes a balance between vocabulary size and word fragmentation. If Unicode isn’t broken down, the vocabulary size becomes excessively large due to Unicode’s vast range. If only 256 bytes are used to form any string without merging them into vocabulary, many semantics embedded in words will be lost, significantly increasing learning difficulty.
Problem with naive BPE:
- It merges too many variants of common words like “dog,” “dogs,” “dog. dog!” and “dog?”
- This approach wastes vocabulary space and model capacity.
The solution involves categorizing strings into different types. For instance, punctuation can be separated from words. Strings cannot be merged across different categories unless it’s a space, which improves compression efficiency.

Model

A decoder-only transformer serves as the backbone.
Pre-norm residual blocks are employed.
The weights in the transformer block are initialized with 1/sqrt(N), where N is the number of layers.

Training Process and Loss Function

The learning rate is tuned using 5% of the WebText dataset.
Four models are trained with a geometrically increasing number of parameters.
Sequence length (seqlen) is set to 1024.
Batch size is set to 512.
Loss function: $L = \sum_{i=1}^n \frac{cross\underline{\hspace{0.5em}}entropy(p_i, x_i)}{seqlen}$ $L = \sum_{i = 1}^{n} \frac{c r o s s e n t r o p y ( p _{i} , x _{i} )}{s e q l e n}$
- Here, p represents the probability of the predicted token, and x denotes the ground truth token.
- Spanning from the first to the last token, the loss is computed as the average cross entropy between the predicted token and ground truth token.

Metrics and Evaluation

Perplexity (PPL): Lower values are better. PPL is calculated as $PPL = \exp(L)$
Bits per byte (BPB): This measures the model’s compression efficiency. Lower values are better. BPB is calculated as $BPB = \frac{L}{log(2)}$
Bits per character (BPC): Lower values are preferable.
Accuracy (ACC): Higher values are better. This metric is used for multiple-choice tasks.

Results and Findings

The model can handle many tasks across different domains in a zero-shot manner, demonstrating considerable improvement on tasks with small datasets.
The model performs reliably on noisy data and out-of-distribution data.

Children’s Book Test

This dataset tests the model’s ability to discern the missing word with a context of 20 sentences provided for each question. The model should capture the language’s long-term dependencies.
Even the smallest model surpasses the state-of-the-art (SOTA).

LAMBADA

The task involves predicting the last word of a sentence. For humans, approximately 50 words are typically needed to predict the last word.
The models outperform the SOTA in terms of perplexity (PPL).
For incorrect predictions, most predictions are valid continuations of the sentence but not valid final words. This suggests that the models are not specialized for this task and do not know they should predict the sentence’s last word.

CoQA

This task tests reading comprehension, and the model needs to understand the conversation history because it includes questions like “why”.
The model surpasses the SOTA without fine-tuning.
Problem and finding: The model often uses simple retrieval-based heuristics, such as directly copying a name from the context to answer a “who” question.

Summarization

The prompt “TL;DR:” is used to induce summarization.
The performance falls short of classic neural baselines.
Problem and finding:
- Capturing long-distance dependencies is challenging. The model tends to summarize only the most recent sentences.
- It makes mistakes on details, such as how many cars were involved in a crash.

Translation

Although English predominates in the training corpus, the model shows limited ability to understand and translate French. The volume of French data is 500 times smaller than typical monolingual translation datasets.
The overall performance significantly trails behind SOTA.

Question Answering

Factoid questions were used to assess and quantify how much knowledge has been learned by the model.
Model size is crucial for performance.
Compared with an open domain question answering system that uses a retrieval-based method, the largest GPT2 performs significantly worse.

Inspiration and Derivatives from This Paper

Distribution of Language

An effective language model should be capable of learning the inherent rules of language, as these rules determine the distribution of language. By evaluating what factors influence language and whether these factors can be learned easily, we can identify the model’s weaknesses and areas for improvement.

Language, an outcome of thought, conceptualizes human senses and cognition, serving as a medium for communication and a vehicle for thought. It can also be viewed as an agent’s reaction to its environment, functioning as an explicit symbol for conveying messages or abstracting entities to facilitate logical reasoning. However, since language compresses information and cannot fully represent its referent, it cannot fully reflect the process of thought but rather serves as an output of this process.

From a text-generation perspective, each subsequent token is conditioned by preceding text. The relationship between the last token and preceding text can be syntactical or semantic. Among these, semantic relationships are the most crucial and challenging to learn. Semantics originate from the agent’s cognitive activity and are influenced by various factors that may be hard to infer from preceding text. By examining these factors and tracking their influence on cognition, we can identify which elements are difficult for models to predict, thereby pinpointing the model’s weaknesses and potential areas for enhancement.

Factors that are challenging to predict typically exhibit the following characteristics:

Strong associations with known conditions (preceding text), such that a high-probability answer can be inferred from previous context. For instance, common sense is easily predictable; if during text generation the given statement is “The sky is…”, we can confidently predict “blue” as the next word. However, personal experiences are unpredictable; if given “When I was a child…”, predicting the next word becomes challenging due to the unique nature of individual experiences.
For unknown factors not directly related to previous context but indirectly influencing future text through the state of the agent, predictability becomes challenging.

Factors that might influence current semantics include:

Preceding text: This factor is easy for a well-trained model to predict as the meaning can typically be inferred from the context.
Environment: This represents the agent’s physical and social surroundings, which might not be explicitly stated in the text.
- Context: This can usually be inferred to a large degree from preceding text and can significantly influence subsequent text. However, it may not always be fully embodied in preceding text.
- Agent’s environment: Very little about this can be inferred from preceding text. Although it might not directly impact preceding text, it indirectly influences future text by affecting the agent’s state.
Memory
- Common knowledge and cultural background: These are easily predictable; given preceding context, we can confidently infer what common knowledge is used. This directly impacts future text in a similar manner as preceding context.
- Personal experiences: These are hard to predict unless explicitly mentioned in preceding text. Like preceding context, personal experiences directly influence future text.
- Domain knowledge: This is relatively easy to predict by determining a field or domain of knowledge through the context and preceding text. The challenge here lies in small-scale models that do not have enough capacity to accommodate this low-frequency knowledge.
Inner state refers to intermediate products of the cognitive process, rather than the preceding text itself. This does not aim to describe the entire cognitive process but rather views it as a black box, assuming intermediate outputs for better understanding of the model.
- Perception and understanding of environment: These are hard to infer but more direct than environmental factors.
- Thoughts and understanding of preceding text: These are relatively easier to infer since text is an output of thought. The only challenging part is predicting a complete thought or cognitive process; however, they significantly influence future text.
- Motivation: This is relatively hard to predict. Although most people will have similar motivations under similar circumstances, the environment is unpredictable, making motivations equally hard to predict. Motivations indirectly influence future text by affecting thoughts and emotions.
- Emotions: These are usually reflected in text and thus easier to infer; they directly influence future text quite significantly.

This analysis indicates that even though large language models display impressive language comprehension abilities, there are still factors difficult to infer from preceding text. By providing additional information, we could potentially enhance the model’s understanding of language, particularly the underlying generation process, thereby improving its overall performance.

Why does this training procedure work?

A key advantage of this method is that for each token, the input’s distribution is similar; this means that any substring within a piece of text, designated as $p(x_i, x_{<i})$ , falls within the training distribution. Although training is conducted in parallel, attention masking ensures that the model only sees preceding tokens and not future ones, thereby aligning the distributions during the training and inference phases.

A sufficiently challenging objective is also crucial as it compels the model to learn the intrinsic rules influencing language distribution. If the training data covers a wide range, the model can’t merely overfit a subset of data through shortcuts; instead, it must predict all data by deciphering these inherent rules, thereby encompassing a vast body of knowledge. However, since the model can only indirectly learn about inner states, its understanding cannot fully represent cognitive states, resulting in limited logical and inferential capabilities. To render text more regular, the content should ideally encompass comprehensive information; for reference on this aspect, we can refer back to the analysis on language distribution. Despite its challenging nature, this objective is achievable by leveraging Transformers’ capability to capture long-distance dependencies, allowing model convergence.

Why the model can zero-shot many tasks?

Zero-shot refers to a model’s ability to perform reasonably well on a task without having been specifically trained for it. The objective of this model is to predict subsequent content based on preceding text, thereby creating a unified prediction format. This format can effectively guide the model’s behavior through prompts, harmonizing the way context and tasks are conditioned. Given sufficient conditions, the model can understand the task at hand. The model’s ability to not only comprehend tasks but also accomplish them stems from its understanding of the underlying rules of language. Essentially, it learns these deeper determinants of language distribution from extensive corpora; viewed from this deeper level, downstream tasks do not exceed the scope of the training distribution.

Why has this level of intelligence emerged in natural language modalities, while video-based auto-regressive models haven’t reached this level yet?

Language is a product of the thought process, serving as a vehicle for thought and a means of expressing dense semantics. Therefore, it is closer to human-understood intelligence compared to imagery. For video-based auto-regressive models, learning the laws of the world from videos and gaining an understanding of these global change patterns through a similar self-supervised approach is a more challenging task. Language inherently sets a context; even scattered text on the internet can construct context to some extent, although video can also achieve this.

Finally, from a hardware resource perspective, language has high information density; a small amount of text can convey substantial information. In contrast, imagery requires numerous pixels to express comparable information, making the data throughput requirements for language lower than visual modalities.