🤖 AI Summary
To address the sparse reward signals and training instability arising from weak reasoning capabilities of compact language models (e.g., 0.5B parameters) in agent-based retrieval-augmented generation (RAG), this paper proposes Distillation-Guided Policy Optimization (DGPO). DGPO integrates behavior cloning for cold-start initialization, knowledge distillation from a stronger teacher, and reinforcement learning–based policy optimization, augmented by a fine-grained evaluation metric—ARC—that separately quantifies reasoning, search coordination, and response synthesis abilities. By leveraging continuous behavioral guidance from the teacher model, DGPO effectively mitigates exploration difficulties inherent to small models, significantly improving training stability and convergence efficiency. Experiments demonstrate that the 0.5B model successfully reproduces complex, multi-step search behaviors across diverse agent search tasks, with several metrics surpassing those of larger teacher models. This work establishes a novel paradigm for deploying lightweight models in resource-constrained RAG applications.
📝 Abstract
Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.