🤖 AI Summary
To address performance bottlenecks in training long-context large language models—stemming from data domain shift and semantic fragmentation—this paper proposes a query-centric data synthesis paradigm. Methodologically, it introduces the first query-prediction-driven document clustering mechanism, integrating generative query prediction, keyword-query joint embedding matching, semantic similarity-based grouping, and dynamic synthesis to jointly preserve semantic coherence and enhance contextual diversity. The approach supports modeling sequences up to one million tokens and scales across model sizes. Empirically, it substantially outperforms standard fine-tuning, KNN-based retrieval, and in-context learning (ICLM) baselines on multiple long-context benchmarks. Crucially, it provides the first empirical validation of both effectiveness and scalability for million-token-context modeling, demonstrating robust generalization across diverse long-context reasoning and retrieval tasks.
📝 Abstract
Recent advancements in large language models (LLMs) have highlighted the importance of extending context lengths for handling complex tasks. While traditional methods for training on long contexts often use filtered long documents, these approaches lead to domain imbalances, limiting model performance. To address this, techniques like random document concatenation (Standard) and similarity-based methods (KNN, ICLM) have been developed. However, they either sacrifice semantic coherence or diversity. To balance both aspects, we introduce Quest, a query-centric data synthesis method aggregating semantically relevant yet diverse documents. Quest uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Extensive experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens and confirming its scalability across various model sizes.