🤖 AI Summary
Existing RAG methods suffer from narrow scenario coverage, poor task generalization, and a lack of large-scale, high-quality instruction data for training. To address these limitations, we propose the first general-purpose RAG instruction synthesis framework. Our method systematically unifies five core RAG paradigms—retrieval-augmented question answering, summarization, reasoning, fact verification, and dialogue—and integrates instruction simulation with Wikipedia-driven synthetic data generation, yielding RAG-Inst: a comprehensive, 40K-sample, multi-paradigm RAG instruction dataset. Technically, our approach combines multi-paradigm retrieval-generation relationship modeling, instruction-aware transfer learning, and simulation-based data augmentation. Extensive experiments demonstrate that models trained on RAG-Inst achieve substantial gains in zero-shot RAG generalization, consistently outperforming state-of-the-art baselines across diverse benchmarks. Both the code and the RAG-Inst dataset are publicly released.
📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.