Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

๐Ÿ“… 2025-10-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Long-context pretraining suffers from sparse long-range dependencies in training data and low training efficiency. This paper proposes LongFilter, the first framework to quantify semantic gains from context expansion via information gainโ€”enabling automatic, high-quality selection of long-dependency samples without human annotation. Built upon LLaMA-3-8B, it introduces a contrastive prediction mechanism comparing short- and long-context outputs, coupled with dynamic information gain estimation and adaptive data filtering. Evaluated on HELMET, LongBench, and RULER, LongFilter significantly improves performance across benchmarks, scaling context length from 8K to 64K tokens while outperforming random sampling baselines in both training efficiency and final model quality. Its core contributions are: (1) an information-gain-driven paradigm for long-dependency assessment; and (2) an efficient, scalable, annotation-free data filtering framework.

Technology Category

Application Category

๐Ÿ“ Abstract
Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
Problem

Research questions and friction points this paper is trying to address.

Identifying meaningful long-range dependencies in pretraining data
Quantifying information gain from extended versus local context
Selecting high-quality data for efficient long-context model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

LongFilter framework curates long-context pretraining data
Measures information gain from extended versus short context
Identifies samples requiring long-range dependencies for training
๐Ÿ”Ž Similar Papers
No similar papers found.