Linformer

Last updated:
Created:

Linformer (2020) is built on the observation that the self-attention matrix is "low-rank" and so contains a lot of redundancy. The authors introduce two learned projection matrices EE and FF. These matrices project the Key and Value inputs along the sequence length dimension, rather than the feature dimension, reducing them from the typical n×dn \times d to k×dk \times d, where kk is a much smaller fixed-size number. The Query matrix (Q) remains its original size n×dn \times d to ensure the output sequence maintains the correct length. The attention calculation ends up being:

Attention=softmax(Q(Kprojected)Tdk)Vprojected\text{Attention} = \text{softmax}\left(\frac{Q⋅(K_\text{projected})^T}{\sqrt{d_k}}\right) \cdot V_\text{projected}
Linformer attention

Because kk is a fixed constant much smaller than nn, the complexity drops from O(n2)O(n^2) to O(nk)O(nk). Since kk does not grow with the input, this is effectively linear complexity O(n)O(n).

Tags: AI