JbJVWljk7r@OpenReview

Total: 1

#1 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training [PDF1] [Copy] [Kimi] [REL]

Authors: Jintao Zhang, Jia wei, Haoxu Wang, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Kai Jiang, Jun Zhu, Jianfei Chen

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new $\texttt{FP4}$ Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves $\textbf{1038}$ $\texttt{TOPS}$ on $\texttt{RTX5090}$, which is a $\textbf{5}\times$ speedup over the fastest FlashAttention on $\texttt{RTX5090}$. Experiments show that our $\texttt{FP4}$ attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient $\texttt{8-bit}$ attention for both forward and backward propagation. Experiments indicate that $\texttt{8-bit}$ attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.

Subject: NeurIPS.2025 - Spotlight