[Paper Note] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
https://arxiv.org/abs/2006.16236
-
Aiming to address the quadratic complexity issue in Transformers.
-
Self-attention can be expressed as a linear dot-product of kernel feature maps, achieving O(N) complexity.
-
When applying a kernel with positive similarity scores on the queries and keys, linear attention converges normally.
-
Previous works (Blanc & Rendle, 2017; Rawat et al., 2019) have approximated softmax with a linear dot product of feature maps to speed up training through sampling.
Our Approach
The purpose of softmax is to identify the tokens most relevant to the current token.
Linear Attention
What can replace softmax?
- Polynomial attention: sim(Q,K)=(QTK+c)d
- RBF attention: sim(Q,K)=exp(−γ∣∣Q−K∣∣2)
- The key is to ensure non-negative similarity scores.
Example of sim(Q,K): sim(Q,K)=QTK
We can write a generalized attention equation for any similarity function as follows:
Vi′=∑j=1Nsim(Qi,Kj)∑j=1Nsim(Qi,Kj)Vj
Define a kernel function ϕ:Rd→RC that maps to a C-dimensional feature space.
Vi′=∑j=1Nϕ(Qi)Tϕ(Kj)∑j=1Nϕ(Qi)Tϕ(Kj)Vj
To simplify this expression, let’s temporarily ignore the ϕ function and directly use QiTKj.
Vi′=∑j=1NQiTKj∑j=1NQiTKjVj
Then, it can be rewritten as:
Vi′=QiT(∑j=1NKj)QiT(∑j=1NKjVj)
This is possible because ∑j=1Ncxjyj=c∑j=1Nxjyj, and Qi is a constant in this context.
By expressing the equation in this form, we avoid repeatedly calculating ∑j=1NKjVj and ∑j=1NKj.
Note that Qi cannot be canceled out from the numerator and denominator because Qi is a vector, not a scalar.
We define ϕ(x)=elu(x)+1 as the kernel function.
ELU (Exponential Linear Unit) is defined as:
ELU(x)={xα(ex−1)if x≥0if x<0
where α is a hyperparameter, typically set to 1.
An Interesting Question
Doesn’t linear attention still require reading the entire model to generate the next token? If the bottleneck is mainly memory bandwidth, then it doesn’t offer much advantage.