🤖 AI Summary
To address the high cost of synthesizing high-quality data for long-context LLM training and the low computational efficiency of existing relevance-aggregation methods, this paper proposes an efficient long-context data synthesis framework. Methodologically, it constructs a hierarchical topic architecture grounded in the BISAC classification system and employs a multi-LLM agent debate mechanism to generate semantically coherent, viewpoint-diverse long texts; lightweight BM25 retrieval coupled with 128K-token document stitching enables low-cost context scaling. The key contribution lies in the deep integration of structured knowledge organization with collaborative content generation—balancing diversity, consistency, and scalability. Our framework achieves state-of-the-art performance on the HELMET and Ruler benchmarks, incurs significantly lower data generation overhead than baselines, and seamlessly integrates with diverse long-dependency enhancement techniques.
📝 Abstract
High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.