LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

KVZDNEoC0Q@OpenReview

Total: 1

#1 LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [PDF¹] [Copy] [Kimi⁴] [REL]

Authors: Rya Sanovar, Srikant Bharadwaj, Renée St. Amant, Victor Rühle, Saravan Rajmohan

Transformer-based large language models are memory hungry and incur significant inference latencies even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. To that end, we propose LeanAttention, a scalable, hardware-efficient, “exact” attention acceleration mechanism for the decode-phase of transformer-based models. LeanAttention enables scaling the attention mechanism for the challenging case of long context lengths by re-designing the attention execution flow for the decode-phase. As a result, we achieve an average of 1.73x speedup in attention execution compared to FlashDecoding, with up to 2.18x speedup for 256k context length.

Subject: MLSYS.2025

KVZDNEoC0Q@OpenReview

#1 LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [PDF1] [Copy] [Kimi4] [REL]

#1 LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [PDF¹] [Copy] [Kimi⁴] [REL]