RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis

📅 2026-01-13

📈 Citations: 1

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This work addresses the scarcity of high-quality training data in existing retrieval-augmented generation (RAG) systems, which hinders their ability to perform autonomous planning and error recovery in complex, noisy real-world retrieval environments. To overcome this limitation, the authors propose RAGShaper, a novel framework that introduces an adversarial distractor information tree integrating both perceptual and cognitive layers. By employing a constrained navigation policy, RAGShaper guides a teacher agent to explicitly generate robust reasoning trajectories. The approach synergistically combines automated data synthesis, InfoCurator-based information tree construction, and adversarial distractor injection to efficiently produce high-fidelity training data. Experimental results demonstrate that models trained on this synthetic data significantly outperform current baselines in noise-intensive and complex retrieval tasks, exhibiting enhanced robustness and reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.

Problem

Research questions and friction points this paper is trying to address.

Agentic RAG

training data scarcity

retrieval noise

dynamic reasoning

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic RAG

Data Synthesis

Adversarial Distractors