Language Model on TinyStories

About this project

This is a preliminary experiment for pretraining language model, and using distillation to accelerate training and improve performance. The experiment is to verify the effectiveness of the following method:
- DeepNet (https://arxiv.org/pdf/2203.00555.pdf)
- Distillation framework, and corresponding loss function (https://arxiv.org/pdf/2002.10957.pdf)
- New Optimizer Sophia (https://arxiv.org/pdf/2305.14342.pdf)
- front window and attention window, LM_infinite (https://arxiv.org/pdf/2308.16137.pdf)
- other hyperparameters of the training process.
- using MSE loss for soft label
- group-query attention (https://arxiv.org/pdf/2305.13245.pdf)
- computation budget (https://arxiv.org/pdf/2203.15556.pdf)
Dearth-tiny is a LM trained based on TinyStories. This model can write short stories with children’s level vocabulary; it is not used for instruction QA.

Obtain smaller models through distillation that can be run on CPU and mobile devices
Compared with pruning methods like Sheared LLaMA, distillation allows more flexible model structure
Compared with direct training, distillation may improve the effect and accelerate training
Make the model deeper and have more layers, improving the effect to a certain extent
Trying to make the model handle extremely long sequences

Due to the limitation of training data, this model can only generate short stories using very simple words, which means most language is out of distribution.
The purpose of the TinyStories dataset is to serve as a sanity check, verifying the effect of the model and training process; the scope and knowledge of the data are very limited, so it does not require a large number of parameters to show a good performance.

DeepNet: Amplify the residual connection so that the gradient of the layer close to the input will not be too small; allow a deeper model structure to improve the effect to a certain extent;
LM_infinite: using the front attention window to make later tokens can attend to the front token with enough weights, preventing the out-of-distribution problem.
Mistral: using attention window to solve long sequence problem (https://arxiv.org/pdf/2310.06825.pdf)
Adding rotary position embedding to query and key vectors.

One training objective is to imitate the attention map and value, and the other is to imitate the logits.
Using MSE loss for soft label and comparing logits between teacher and student, because according to the profiler’s result, softmax required by KL divergence is very time-consuming.
Seqence length = 256, batch_size for distillation = 200, batch_size for training = 800
2k steps for distillation with 300 steps warmup, 2k steps for training.
Using Sophia optimizer, peak lr = 5e-4, beta1 = 0.9, beta2 = 0.99, weight decay = 0.2.

In the previous project about low-rank LM, the loss function is wrong, because it only checks the logits of the last token, causing a very slow and unstable training process. It is unreasonable to only check the last token’s logits, because the model is trained to predict the next token, not the last token, so every output is usable to estimate the loss.

Figure 1: shows each loss component. After 2000 steps, loss_soft and loss_mimic are not used in the training process.

Figure 2: learning rate for distillation and training.

Compared with the teacher model, the student model has fewer parameters. If the student spend too much capacity on fitting the internal structure, it may be difficult to fit the hard label loss (the loss for the next token).
Whether the distillation process, that is, learning the internal structure, can replace the fact that the model has experienced a large number of tokens; that is to say, is the ability of the large language model derived from the acquisition of attention map, or from the subtle internal representation that must learn from diverse and numerous training data?

Adjust the learning rate so that the training process can be intervened manually to adjust; avoid interrupting the training.
Distill 7B into 1B, and then compare it with other 1B models
Instruction fine-tuning.

PPO for alignment.
Memory-augmented model, adding memory to improve the effect of the model; the memory stored in the database can make the model more controllable and may reduce the hallucination caused by the memory stored in the parameters.