Does Self-Attention Need Separate Weights in Transformers?

#1 Does Self-Attention Need Separate Weights in Transformers? [PDF¹] [Copy] [Kimi] [REL]

Authors: Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu, Ozlem Garibay, Niloofar Yousefi

Self-attention has revolutionized natural language processing by capturing long-range dependencies and improving context understanding. However, it comes with high computational costs and struggles with sequential data’s inherent directionality. This paper investigates and presents a simplified approach called “shared weight self-attention,” where a single weight matrix is used for Keys, Queries, and Values instead of separate matrices for each. This approach cuts training parameters by more than half and significantly reduces training time. Our method not only improves efficiency but also achieves strong performance on tasks from the GLUE benchmark, even outperforming the standard BERT baseline in handling noisy and out-of-domain data. Experimental results show a 66.53% reduction in parameter size within the attention block and competitive accuracy improvements of 3.55% and 0.89% over symmetric and pairwise attention-based BERT models, respectively.

Subject: NAACL.2025 - Industry Track

2025.naacl-industry.44@ACL

#1 Does Self-Attention Need Separate Weights in Transformers? [PDF1] [Copy] [Kimi] [REL]

#1 Does Self-Attention Need Separate Weights in Transformers? [PDF¹] [Copy] [Kimi] [REL]