Google & UC Berkeley ‘Reformer’ Runs 64K Sequences on One GPU

Transformer models are an increasingly popular neural network architecture in the natural language processing (NLP) research field, where large transformers can achieve the state-of-the-art performance on many tasks. The tradeoff is transformers’ excessive compute consumption and cost, especially for training models on long sequences.

A recent paper published by Google and UC Berkeley researchers and accepted by the prestigious International Conference on Learning Representations (ICLR 2020) proposes a new transformer model called “Reformer” which achieves impressive performance even when running on only a single GPU.

To improve transformer efficiency, researchers replaced dot-product attention with locality-sensitive hashing (LSH) to change the complexity from O (L2) to O (L log L), where L refers to the length of the sequence. LSH is an algorithmic technique used for nearest neighbor search when mining similar items from massive data.

Researchers also used reversible residual layers instead of standard residuals, which enabled storing activations only once during the training process instead of N times (where N represents the number of layers). The final Reformer model performed similarly compared to the Transformer model, but showed higher storage efficiency and faster speed on long sequences.

Researchers conducted experiments on the image generation task imagenet64 with sequences of length 12K and a text task enwik8 with sequences of length 64K, to compare the conventional Transformer with the proposed reversible Transformer. Both Transformers had the same number of parameters and the learning curves were almost the same. The experiment results showed that the reversible Transformer saves memory without sacrificing accuracy.

Effect of shared query-key space (left) and reversibility (right) on performance on enwik8 and imagenet64 training. The curves show bits per dim on held-out data.

LSH attention is an approximation of full attention, and its…

Source Link