LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the high cost of synthesizing high-quality data for long-context LLM training and the low computational efficiency of existing relevance-aggregation methods, this paper proposes an efficient long-context data synthesis framework. Methodologically, it constructs a hierarchical topic architecture grounded in the BISAC classification system and employs a multi-LLM agent debate mechanism to generate semantically coherent, viewpoint-diverse long texts; lightweight BM25 retrieval coupled with 128K-token document stitching enables low-cost context scaling. The key contribution lies in the deep integration of structured knowledge organization with collaborative content generation—balancing diversity, consistency, and scalability. Our framework achieves state-of-the-art performance on the HELMET and Ruler benchmarks, incurs significantly lower data generation overhead than baselines, and seamlessly integrates with diverse long-dependency enhancement techniques.

Technology Category

Application Category

📝 Abstract

High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.

Problem

Research questions and friction points this paper is trying to address.

Efficient synthesis of long-context training data

Reducing computational costs in data generation

Improving quality through structured topic organization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured topic organization with BISAC system

Multi-agent debate for diverse topic generation

Lightweight BM25 retrieval for document concatenation

🔎 Similar Papers

Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model