LongAttn: Selecting Long-context Training Data via Token-level Attention

#1 LongAttn: Selecting Long-context Training Data via Token-level Attention [PDF] [Copy] [Kimi] [REL]

Authors: Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li

With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with **long-range dependencies** is crucial. Existing methods to select long-context data often rely on sentence-level analysis,which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, **LongAttn** , which leverages the self-attention mechanism of LLMs to measure the **long-range dependencies** for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies **long-range dependencies** , enabling more accurate and efficient data selection. We filter **LongABC-32K** from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent **effectiveness** , **scalability** , and **efficiency** . We will release our code and the high-quality long-context dataset **LongABC-32K** in the future.

Subject: ACL.2025 - Findings

2025.findings-acl.991@ACL

#1 LongAttn: Selecting Long-context Training Data via Token-level Attention [PDF] [Copy] [Kimi] [REL]