XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] Language Models are Unsupervised Multitask Learners
[Paper Note] Language Models are Unsupervised Multitask Learners
A general purpose training procedure that can be applied to a variety of NLP tasks in a zero-shot manner.

Key Insights

Problems of Previous Methods

Advantages of This Method

Approach/Core Idea

Method

About Input

Encoding Strategy: Byte Pair Encoding (BPE)

Model

Training Process and Loss Function

Metrics and Evaluation

Results and Findings

Children’s Book Test

LAMBADA

CoQA

Summarization

Translation

Question Answering

Inspiration and Derivatives from This Paper

Distribution of Language

An effective language model should be capable of learning the inherent rules of language, as these rules determine the distribution of language. By evaluating what factors influence language and whether these factors can be learned easily, we can identify the model’s weaknesses and areas for improvement.

Language, an outcome of thought, conceptualizes human senses and cognition, serving as a medium for communication and a vehicle for thought. It can also be viewed as an agent’s reaction to its environment, functioning as an explicit symbol for conveying messages or abstracting entities to facilitate logical reasoning. However, since language compresses information and cannot fully represent its referent, it cannot fully reflect the process of thought but rather serves as an output of this process.

From a text-generation perspective, each subsequent token is conditioned by preceding text. The relationship between the last token and preceding text can be syntactical or semantic. Among these, semantic relationships are the most crucial and challenging to learn. Semantics originate from the agent’s cognitive activity and are influenced by various factors that may be hard to infer from preceding text. By examining these factors and tracking their influence on cognition, we can identify which elements are difficult for models to predict, thereby pinpointing the model’s weaknesses and potential areas for enhancement.

Factors that are challenging to predict typically exhibit the following characteristics:

Factors that might influence current semantics include:

This analysis indicates that even though large language models display impressive language comprehension abilities, there are still factors difficult to infer from preceding text. By providing additional information, we could potentially enhance the model’s understanding of language, particularly the underlying generation process, thereby improving its overall performance.

Why does this training procedure work?

A key advantage of this method is that for each token, the input’s distribution is similar; this means that any substring within a piece of text, designated as p(xi,x<i)p(x_i, x_{<i}), falls within the training distribution. Although training is conducted in parallel, attention masking ensures that the model only sees preceding tokens and not future ones, thereby aligning the distributions during the training and inference phases.

A sufficiently challenging objective is also crucial as it compels the model to learn the intrinsic rules influencing language distribution. If the training data covers a wide range, the model can’t merely overfit a subset of data through shortcuts; instead, it must predict all data by deciphering these inherent rules, thereby encompassing a vast body of knowledge. However, since the model can only indirectly learn about inner states, its understanding cannot fully represent cognitive states, resulting in limited logical and inferential capabilities. To render text more regular, the content should ideally encompass comprehensive information; for reference on this aspect, we can refer back to the analysis on language distribution. Despite its challenging nature, this objective is achievable by leveraging Transformers’ capability to capture long-distance dependencies, allowing model convergence.

Why the model can zero-shot many tasks?

Zero-shot refers to a model’s ability to perform reasonably well on a task without having been specifically trained for it. The objective of this model is to predict subsequent content based on preceding text, thereby creating a unified prediction format. This format can effectively guide the model’s behavior through prompts, harmonizing the way context and tasks are conditioned. Given sufficient conditions, the model can understand the task at hand. The model’s ability to not only comprehend tasks but also accomplish them stems from its understanding of the underlying rules of language. Essentially, it learns these deeper determinants of language distribution from extensive corpora; viewed from this deeper level, downstream tasks do not exceed the scope of the training distribution.

Why has this level of intelligence emerged in natural language modalities, while video-based auto-regressive models haven’t reached this level yet?

Language is a product of the thought process, serving as a vehicle for thought and a means of expressing dense semantics. Therefore, it is closer to human-understood intelligence compared to imagery. For video-based auto-regressive models, learning the laws of the world from videos and gaining an understanding of these global change patterns through a similar self-supervised approach is a more challenging task. Language inherently sets a context; even scattered text on the internet can construct context to some extent, although video can also achieve this.

Finally, from a hardware resource perspective, language has high information density; a small amount of text can convey substantial information. In contrast, imagery requires numerous pixels to express comparable information, making the data throughput requirements for language lower than visual modalities.