https://arxiv.org/abs/2411.05007
https://github.com/mit-han-lab/nunchaku
Motivation: Quantization is needed to maintain diffusion model performance while reducing resource consumption.
Unlike LLMs, diffusion models are compute-bound rather than memory-bound. Therefore, weight-only quantization cannot accelerate diffusion models. To achieve speedup, both weights and activations must be quantized to avoid upcasting.
Low-Rank Decomposition: This method uses low-rank decomposition to absorb outliers that cannot be accurately represented by 4-bit quantization.
LoRA Compatibility: This method is naturally compatible with LoRA because the LoRA parameters can be incorporated into the SVD part of the decomposition.
New Engine: A new engine was developed to fuse the computation of the low-rank branch and the quantized branch. This engine also enables the quantization of activations.
The simplest approach is to uniformly map the weights to a 4-bit range.
The mapping is done using the following formula:
Where:
The matrix multiplication can be approximated as
For efficient GPU computation and to avoid upcasting, operands need to be in the same bit-width.
Smoothing (Xiao et al., 2023) can mitigate activation outliers to some extent by per-channel scaling, denoted as . However, this results in new weights equivalent to , which can amplify weight outliers that make quantization harder.
To address this, we decompose the weights into , representing the sum of a low-rank component and a residual component. The rank can be set to 16 or 32, which is significantly smaller than the shape of .
The matrix multiplication can then be expressed as:
Note that , indicating that the outliers have been handled using the SmoothQuant method.
Some proof processes are omitted here. In summary, quantization error is bounded by the magnitude of the residual .
To minimize , i.e. maximize , SVD can be employed. The magnitude of the first 32 singular values is typically large, while the magnitude decreases slowly afterwards. Therefore, these smaller singular values do not need to be included in the low-rank branch.
In practice, we further reduce quantization errors by iteratively updating the low-rank branch through decomposing and adjusting accordingly for several iterations, and then picking the result with the smallest error.
Executing the low-rank branch and the quantized branch sequentially results in a runtime 1.5 times longer than that of the quantized branch alone. This is primarily due to memory bandwidth limitations, as the computational cost of the low-rank branch is small.
By fusing the two branches into a single kernel, we can merge some memory access, thereby minimizing the overhead of the low-rank branch.