Performer (2020) proposed a workaround to the fact that, because softmax is non-linear, we have to calculate the full attention matrix rather than decomposing it into independent parts. It bypasses this bottleneck using a method called FAVOR+ (Fast Attention Via Positive Orthogonal Random Features). Instead of calculating the exact softmax attention, it approximates it using a mathematical kernel trick. The Query and Key vectors are mapped using random Fourier features into a feature space where the dot product approximates the softmax function.
In standard attention, when calculating , we calculate the part first because that the input to softmax function. It's this term that produces the huge matrix. By approximating the softmax as a dot product of feature maps, the equation changes to roughly . Because matrix multiplication is associative , the Performer can compute this in a different order:
This way is calculated first. Since both and have dimensions , the result is a small matrix, and the massive matrix can be avoided.
This reordering allows the Performer to achieve linear time and linear space complexity. The paper claims that this method provably approximates regular softmax attention with a bounded error.
Tags: AI