XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] The Super Weight in Large Language Models

https://arxiv.org/abs/2411.07191

Existing Methods

There are two main categories of quantization methods:

Weight-only Quantization Techniques for Mitigating Outlier Effects

Weight-Activation Quantization

Relationship Between Weight Outliers and Activation Outliers

Identifying Super Weights

In many layers of a language model, a particular dimension of the output (activation) consistently shows a high magnitude at the same position, no matter what the input is. These unusually large activations are caused by what we call “super weights.” We’ve observed that pruning these super weights significantly reduces the magnitude of the corresponding super activation. Our analysis indicates that before the down projection, the element-wise product of the gate and the up projection creates a relatively large activation, which is then further amplified by these super weights.

How to Locate Super Weights

  1. Obtain the output of the down projection, denoted as YY, and locate the activation outlier YijY_{ij}.
  2. Identify the corresponding row WjW_j in the down projection matrix WW, since Y=XWTY=XW^T.
  3. Remove the outliers in WjW_j and check if the magnitude of the activation outlier YijY_{ij} is significantly reduced.

Key Findings

Super Weights Impact Model Performance Beyond Just Super Activations

We found that simply removing super weights, even while restoring super activations, still resulted in a noticeable performance gap compared to the original model in many tests. Restoring super activations alone provides a significant improvement, but it’s not enough to fully recover the original performance.

After removing super weights, the model becomes more likely to output stopwords such as “the”, “,”, and “.”.

In smaller models like Llama-7B, scaling the super weights can lead to a slight improvement in model performance.

Quantization Process

We use the following formulas for quantization and dequantization:

Q(X)=Round(Xmin(X)Δ)Q1(X^)=ΔX^+min(X)Δ=max(X)min(X)2N11Q(\mathbf{X}) = \text{Round}\left(\frac{\mathbf{X} - \text{min}(\mathbf{X})}{\Delta}\right) \\ Q^{-1}(\mathbf{\hat{X}}) = \Delta \cdot \mathbf{\hat{X}} + \text{min}(\mathbf{X}) \\ \Delta = \frac{\text{max}(\mathbf{X}) - \text{min}(\mathbf{X})}{2^{N-1} - 1}

Activation Quantization

Results

Related Work