LongAttn: Selecting Long-context Training Data via Token-level Attention

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data filtering methods for training long-context large language models operate at coarse-grained (e.g., sentence-level) granularity, leading to inefficiency and imprecise identification of semantically critical long-range dependencies. Method: This paper introduces the first token-level long-range dependency quantification framework grounded in self-attention mechanisms. It jointly models two metrics—dependency strength and attention score distribution uniformity—to enable fine-grained, efficient assessment of long-distance semantic associations, departing from conventional sentence-level filtering paradigms. Based on this framework, we design a lightweight long-text filtering algorithm and construct LongABC-32K, a high-quality long-context dataset comprising 32K-token sequences. Contribution/Results: Fine-tuning models on LongABC-32K yields significant performance gains across multiple long-context benchmarks. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long-context capabilities in LLMs
Optimizing long-context data selection efficiency
Quantifying token-level long-range dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level attention for data selection
Self-attention mechanism in LLMs
Quantifies long-range dependencies efficiently
🔎 Similar Papers
No similar papers found.
L
Longyun Wu
Peking University
D
Dawei Zhu
Peking University
Guangxiang Zhao
Guangxiang Zhao
Peking University
AI
Z
Zhuocheng Yu
Peking University
J
Jungfeng Ran
Peking University
X
Xiangyu Wong
Peking University
Lin Sun
Lin Sun
Qihoo 360
large language model
S
Sujian Li
Peking University