Transformers on Edge devices? Monash U’s attention to power saving with linear complexity reduces computational costs by 73%

While transformer architectures have achieved remarkable success in recent years thanks to their impressive representational power, their quadratic complexity leads to excessively high power consumption which hampers their deployment in many real-world applications, especially on low-power peripheral devices. limited resources.

A research team from Monash University addresses this issue in the new paper EcoFormer: energy-efficient attention with linear complexityproposing an attention mechanism with linear complexity — EcoFormer — that replaces costly multiply-accumulate operations with simple accumulations and achieves a 73% reduction in the energy footprint on ImageNet.

The team summarizes its main contributions as follows:

  1. We propose a new binarization paradigm to better preserve pairwise similarity in softmax attention. In particular, we present EcoFormer, an energy-efficient attention with linear complexity powered by kernelized hashing to map requests and keys into compact binary codes.
  2. We learn kernelized hash functions based on ground truth Hamming affinity extracted from attention scores in a self-supervised way.
  3. Extensive experiments on CIFAR-100, ImageNet-1K and Long Range Arena show that EcoFormer is able to significantly reduce energy costs while maintaining accuracy.

The basic idea of ​​this work is to reduce the high cost of attention by applying binary quantization to kernel embeddings to replace energy-consuming multiplications with energy-efficient bitwise operations. The researchers note, however, that conventional binary quantization methods only focus on minimizing quantization error between full and binary precision values, which fails to preserve pairwise semantic similarity between attention tokens. and therefore has a negative impact on performance.

To mitigate this problem, the team introduces a new binarization method that uses a kernel hash with a Radial Gaussian Basis Function (RBF) to map the original high-dimensional query/key pairs to similarity-preserving low-dimensional binary codes. . EcoFormer effectively exploits this binarization method to maintain the semantic similarity of attention while approaching self-attention in linear time with lower energy cost.

In their empirical study, the team compared the proposed EcoFormer with standard multi-head self-attention (MSA) on ImageNet1K. The results show that EcoFormer can reduce power consumption by 73% while incurring only a 0.33% performance trade-off.

Overall, the proposed EcoFormer energy efficient attention mechanism with linear complexity represents a promising approach to reduce the cost bottleneck that has limited the deployment of transformer models. In future work, the team plans to explore the binarization of transformer value vectors in attention, multilayer perceptrons, and nonlinearities to further reduce energy costs; and to extend EcoFormer to NLP tasks such as machine translation and speech analysis.

The Code will be available on the project site GitHub. The paper EcoFormer: energy-efficient attention with linear complexity is on arXiv.


Author: Hecate He | Editor: Michel Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.