Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model

📅 2024-05-30

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 2

career value

182K/year

🤖 AI Summary

To address performance bottlenecks in training long-context large language models—stemming from data domain shift and semantic fragmentation—this paper proposes a query-centric data synthesis paradigm. Methodologically, it introduces the first query-prediction-driven document clustering mechanism, integrating generative query prediction, keyword-query joint embedding matching, semantic similarity-based grouping, and dynamic synthesis to jointly preserve semantic coherence and enhance contextual diversity. The approach supports modeling sequences up to one million tokens and scales across model sizes. Empirically, it substantially outperforms standard fine-tuning, KNN-based retrieval, and in-context learning (ICLM) baselines on multiple long-context benchmarks. Crucially, it provides the first empirical validation of both effectiveness and scalability for million-token-context modeling, demonstrating robust generalization across diverse long-context reasoning and retrieval tasks.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have highlighted the importance of extending context lengths for handling complex tasks. While traditional methods for training on long contexts often use filtered long documents, these approaches lead to domain imbalances, limiting model performance. To address this, techniques like random document concatenation (Standard) and similarity-based methods (KNN, ICLM) have been developed. However, they either sacrifice semantic coherence or diversity. To balance both aspects, we introduce Quest, a query-centric data synthesis method aggregating semantically relevant yet diverse documents. Quest uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Extensive experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens and confirming its scalability across various model sizes.

Problem

Research questions and friction points this paper is trying to address.

Extend context lengths for LLMs

Balance semantic coherence and diversity

Improve long-context task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-centric data synthesis

Semantic relevance and diversity

Scalability up to 1M tokens

🔎 Similar Papers

No similar papers found.