TL;DR

a quantization method where both weights and activations are quantized to 4-bit.
Activation outliers are first processed using a smoothing method. However, this leads to more pronounced weight outliers.
SVD is then used to decompose the weights into a sum of low-rank and residual components. The low-rank part is 16-bit, while the residual part is 4-bit. Because the residual has a smaller range, it is easier to quantize to 4-bit.
This method can be used with LoRA.

Motivation: Quantization is needed to maintain diffusion model performance while reducing resource consumption.

Unlike LLMs, diffusion models are compute-bound rather than memory-bound. Therefore, weight-only quantization cannot accelerate diffusion models. To achieve speedup, both weights and activations must be quantized to avoid upcasting.

Low-Rank Decomposition: This method uses low-rank decomposition to absorb outliers that cannot be accurately represented by 4-bit quantization.

LoRA Compatibility: This method is naturally compatible with LoRA because the LoRA parameters can be incorporated into the SVD part of the decomposition.

New Engine: A new engine was developed to fuse the computation of the low-rank branch and the quantized branch. This engine also enables the quantization of activations.

Quantization Method

The simplest approach is to uniformly map the weights to a 4-bit range.

The mapping is done using the following formula:

Q_X = \text{round}\left(\frac{X}{s_X}\right), s_X = \frac{\max (|X|)}{q_{\max}}.

Where:

$Q_X$ is the low-bit representation of $X$ .
$s_X$ is the scaling factor.
For signed $k$ -bit integer quantization, $q_{\max} = 2^{k-1}-1$ .

The matrix multiplication can be approximated as

XW \approx Q(X)Q(W) = s_X s_W \cdot Q_X Q_W.

For efficient GPU computation and to avoid upcasting, operands need to be in the same bit-width.

Absorbing Outliers with Low-Rank Decomposition

Smoothing (Xiao et al., 2023) can mitigate activation outliers to some extent by per-channel scaling, denoted as $diag(\lambda)^{-1}$ . However, this results in new weights equivalent to $\hat W = W \cdot diag(\lambda)$ , which can amplify weight outliers that make quantization harder.

To address this, we decompose the weights $W$ into $\hat{W} = L_1 L_2 + R$ , representing the sum of a low-rank component and a residual component. The rank can be set to 16 or 32, which is significantly smaller than the shape of $W$ .

The matrix multiplication can then be expressed as:

XW = \hat{X} \hat{W} = \hat{X} L_1 L_2 + \hat{X} R \approx \underbrace{\hat{X} L_1 L_2}_{\text{16-bit low-rank branch}} + \underbrace{Q(\hat{X}) Q(R)}_{\text{4-bit residual}}.

Note that $\hat{X}=X \cdot diag(\lambda)^{-1}$ , indicating that the outliers have been handled using the SmoothQuant method.

Some proof processes are omitted here. In summary, quantization error $||R-Q(R)||$ is bounded by the magnitude of the residual $||R||$ .

To minimize $||R||$ , i.e. maximize $||L_1 L_2||$ , SVD can be employed. The magnitude of the first 32 singular values is typically large, while the magnitude decreases slowly afterwards. Therefore, these smaller singular values do not need to be included in the low-rank branch.

In practice, we further reduce quantization errors by iteratively updating the low-rank branch through decomposing $W−Q(R)$ and adjusting $R$ accordingly for several iterations, and then picking the result with the smallest error.

New Engine: Nunchaku

Executing the low-rank branch and the quantized branch sequentially results in a runtime 1.5 times longer than that of the quantized branch alone. This is primarily due to memory bandwidth limitations, as the computational cost of the low-rank branch is small.

By fusing the two branches into a single kernel, we can merge some memory access, thereby minimizing the overhead of the low-rank branch.

Results

We can obtain images of almost the same quality as those generated in bf16.
The method performs well on both DiT and U-Net architectures.
Significant reductions in both speed and memory footprint.
Ablation studies demonstrate that SVD, int4 quantization, and Smoothing are all necessary components of the method.

Smoothquant: Accurate and efficient post-training quantization for large language models

TL;DR

Quantization Method

Absorbing Outliers with Low-Rank Decomposition

New Engine: Nunchaku

Results

Related Work