Aiming to address the quadratic complexity issue in Transformers.
Self-attention can be expressed as a linear dot-product of kernel feature maps, achieving O(N) complexity.
When applying a kernel with positive similarity scores on the queries and keys, linear attention converges normally.
Previous works (Blanc & Rendle, 2017; Rawat et al., 2019) have approximated softmax with a linear dot product of feature maps to speed up training through sampling.

Our Approach

The purpose of softmax is to identify the tokens most relevant to the current token.

Linear Attention

What can replace softmax?

Polynomial attention: $\text{sim}(Q, K) = (Q^T K + c)^d$
RBF attention: $\text{sim}(Q, K) = \exp(-\gamma ||Q - K||^2)$
The key is to ensure non-negative similarity scores.

Example of $\text{sim}(Q, K)$ : $\text{sim}(Q, K) = Q^T K$

We can write a generalized attention equation for any similarity function as follows:

\newcommand{\similarity}[1]{\text{sim}\left(#1\right)} V'_i = \frac{\sum_{j=1}^N \similarity{Q_i, K_j} V_j} {\sum_{j=1}^N \similarity{Q_i, K_j}}

Define a kernel function $\phi: \mathbb{R}^d \rightarrow \mathbb{R}^C$ that maps to a C-dimensional feature space.

\newcommand{\fe}[1]{\phi\left(#1\right)} V'_i = \frac{\sum_{j=1}^N \fe{Q_i}^T \fe{K_j} V_j} {\sum_{j=1}^N \fe{Q_i}^T \fe{K_j}}

To simplify this expression, let’s temporarily ignore the $\phi$ function and directly use $Q_i^T K_j$ .

V'_i=\frac{\sum_{j=1}^N Q_i^T K_j V_j} {\sum_{j=1}^N Q_i^T K_j}

Then, it can be rewritten as:

V'_i = \frac{Q_i^T \left(\sum_{j=1}^N K_j V_j\right)} {Q_i^T \left(\sum_{j=1}^N K_j\right)}

This is possible because $\sum_{j=1}^N c x_j y_j = c \sum_{j=1}^N x_j y_j$ , and $Q_i$ is a constant in this context.

By expressing the equation in this form, we avoid repeatedly calculating $\sum_{j=1}^N K_j V_j$ and $\sum_{j=1}^N K_j$ .

Note that $Q_i$ cannot be canceled out from the numerator and denominator because $Q_i$ is a vector, not a scalar.

We define $\phi(x)=\text{elu}(x)+1$ as the kernel function.

ELU (Exponential Linear Unit) is defined as:

\text{ELU}(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha (e^x - 1) & \text{if } x < 0 \end{cases}

where $\alpha$ is a hyperparameter, typically set to 1.

An Interesting Question

Doesn’t linear attention still require reading the entire model to generate the next token? If the bottleneck is mainly memory bandwidth, then it doesn’t offer much advantage.

Our Approach

Linear Attention

An Interesting Question

Related Work