XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
[Paper Note] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

https://arxiv.org/abs/2411.05007

https://github.com/mit-han-lab/nunchaku

TL;DR

  • a quantization method where both weights and activations are quantized to 4-bit.
  • Activation outliers are first processed using a smoothing method. However, this leads to more pronounced weight outliers.
  • SVD is then used to decompose the weights into a sum of low-rank and residual components. The low-rank part is 16-bit, while the residual part is 4-bit. Because the residual has a smaller range, it is easier to quantize to 4-bit.
  • This method can be used with LoRA.

Motivation: Quantization is needed to maintain diffusion model performance while reducing resource consumption.

Unlike LLMs, diffusion models are compute-bound rather than memory-bound. Therefore, weight-only quantization cannot accelerate diffusion models. To achieve speedup, both weights and activations must be quantized to avoid upcasting.

Low-Rank Decomposition: This method uses low-rank decomposition to absorb outliers that cannot be accurately represented by 4-bit quantization.

LoRA Compatibility: This method is naturally compatible with LoRA because the LoRA parameters can be incorporated into the SVD part of the decomposition.

New Engine: A new engine was developed to fuse the computation of the low-rank branch and the quantized branch. This engine also enables the quantization of activations.

Quantization Method

The simplest approach is to uniformly map the weights to a 4-bit range.

The mapping is done using the following formula:

QX=round(XsX),sX=max(X)qmax.Q_X = \text{round}\left(\frac{X}{s_X}\right), s_X = \frac{\max (|X|)}{q_{\max}}.

Where:

The matrix multiplication can be approximated as

XWQ(X)Q(W)=sXsWQXQW.XW \approx Q(X)Q(W) = s_X s_W \cdot Q_X Q_W.

For efficient GPU computation and to avoid upcasting, operands need to be in the same bit-width.

Absorbing Outliers with Low-Rank Decomposition

Smoothing (Xiao et al., 2023) can mitigate activation outliers to some extent by per-channel scaling, denoted as diag(λ)1diag(\lambda)^{-1}. However, this results in new weights equivalent to W^=Wdiag(λ)\hat W = W \cdot diag(\lambda), which can amplify weight outliers that make quantization harder.

To address this, we decompose the weights WW into W^=L1L2+R\hat{W} = L_1 L_2 + R, representing the sum of a low-rank component and a residual component. The rank can be set to 16 or 32, which is significantly smaller than the shape of WW.

The matrix multiplication can then be expressed as:

XW=X^W^=X^L1L2+X^RX^L1L216-bit low-rank branch+Q(X^)Q(R)4-bit residual.XW = \hat{X} \hat{W} = \hat{X} L_1 L_2 + \hat{X} R \approx \underbrace{\hat{X} L_1 L_2}_{\text{16-bit low-rank branch}} + \underbrace{Q(\hat{X}) Q(R)}_{\text{4-bit residual}}.

Note that X^=Xdiag(λ)1\hat{X}=X \cdot diag(\lambda)^{-1}, indicating that the outliers have been handled using the SmoothQuant method.

Some proof processes are omitted here. In summary, quantization error RQ(R)||R-Q(R)|| is bounded by the magnitude of the residual R||R||.

To minimize R||R||, i.e. maximize L1L2||L_1 L_2||, SVD can be employed. The magnitude of the first 32 singular values is typically large, while the magnitude decreases slowly afterwards. Therefore, these smaller singular values do not need to be included in the low-rank branch.

In practice, we further reduce quantization errors by iteratively updating the low-rank branch through decomposing WQ(R)W−Q(R) and adjusting RR accordingly for several iterations, and then picking the result with the smallest error.

New Engine: Nunchaku

Executing the low-rank branch and the quantized branch sequentially results in a runtime 1.5 times longer than that of the quantized branch alone. This is primarily due to memory bandwidth limitations, as the computational cost of the low-rank branch is small.

By fusing the two branches into a single kernel, we can merge some memory access, thereby minimizing the overhead of the low-rank branch.

Results

Related Work