Reformer (2020) was able to reduce the number of attention operations from to by using a technique called locality-sensitive hashing (LSH). A hash function is applied in which a "sphere" is rotated around the vectors. If two vectors point in similar directions they will point towards the same section of the sphere a lot and end up with a similar hashes.


These hashes are then sorted so similar hashes are close to each other before being divided up into chunks. The model then only calculates attention with a chunk and its immediate neighbours. The idea is that, because similar words are grouped by the hash, the model doesn't miss important connections. But it is able to avoid calculating the relationships between unrelated words.
Reformer had a few other tricks to reduce its memory footprint. It used reversible layers, where the input vector where a function is applied to half of each layer's input vector. Running these transformations in reverse allows us to derive the inputs, so the values don't need to be held in memory. This has turned into a relative dead end for numerical stability reasons, with more recent models relying on gradient checkpointing instead. The model's authors were still able to experiment with up to 64K tokens, though.
Tags: AI