NExtLong: Toward Effective Long-Context Training without Long Documents

πŸ“… 2025-01-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Scarcity of long-text training data hinders large language models’ ability to model long-range dependencies. To address this, we propose Negative Document Expansion (NDE): a novel data synthesis framework that leverages meta-block decomposition and cross-document hard negative retrieval to interleave semantically related yet logically incoherent negative samples into the original text, thereby constructing high-difficulty synthetic long-context sequences. Subsequently, supervised long-range dependency discrimination training is employed to strengthen the model’s capacity to identify genuine long-range relational structures. To our knowledge, NDE is the first systematic framework to exploit hard negatives for enhancing synthetic long-text generation. Experiments on the HELMET and RULER benchmarks demonstrate that NDE significantly outperforms existing synthetic-data approaches and mainstream models trained exclusively on authentic long documents, substantially reducing reliance on non-synthetic long-document corpora.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Long Text Understanding
Long Document Data Lack
Innovation

Methods, ideas, or system contributions that make the work stand out.

NExtLong
Random Irrelevant Content
Enhanced Long-text Processing
πŸ”Ž Similar Papers
No similar papers found.
Chaochen Gao
Chaochen Gao
Institute of Information Engineering,Chinese Academy of Sciences
NLP Long-Context LLM
X
Xing Wu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Zijia Lin
Zijia Lin
Tsinghua University
information retrievalcomputer visionnatural language processingmachine learning
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning
S
Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences