DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

#1 DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [PDF³] [Copy] [Kimi¹] [REL]

Authors: Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li

The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens. In general, the attention scores are determined simply by the key-query products. However, this work’s occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, **the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem**, which is called Convolutional Data-Adaptive Position Encoding (CDAPE).The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.

Subject: ACL.2025 - Long Papers

2025.acl-long.522@ACL

#1 DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [PDF3] [Copy] [Kimi1] [REL]

#1 DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [PDF³] [Copy] [Kimi¹] [REL]