Rethinking Softmax: Self-Attention with Polynomial Activations | Cool Papers

#1 Rethinking Softmax: Self-Attention with Polynomial Activations [PDF⁴] [Copy] [Kimi²] [REL]

Authors: Hemanth Saratchandran ; Jianqiao Zheng ; Yiping Ji ; Wenbo Zhang ; Simon Lucey

This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.

Subjects: Machine Learning ; Computer Vision and Pattern Recognition ; Machine Learning

Publish: 2024-10-24 10:08:25 UTC

2410.18613

#1 Rethinking Softmax: Self-Attention with Polynomial Activations [PDF4] [Copy] [Kimi2] [REL]

#1 Rethinking Softmax: Self-Attention with Polynomial Activations [PDF⁴] [Copy] [Kimi²] [REL]