Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the sparse reward signals and training instability arising from weak reasoning capabilities of compact language models (e.g., 0.5B parameters) in agent-based retrieval-augmented generation (RAG), this paper proposes Distillation-Guided Policy Optimization (DGPO). DGPO integrates behavior cloning for cold-start initialization, knowledge distillation from a stronger teacher, and reinforcement learning–based policy optimization, augmented by a fine-grained evaluation metric—ARC—that separately quantifies reasoning, search coordination, and response synthesis abilities. By leveraging continuous behavioral guidance from the teacher model, DGPO effectively mitigates exploration difficulties inherent to small models, significantly improving training stability and convergence efficiency. Experiments demonstrate that the 0.5B model successfully reproduces complex, multi-step search behaviors across diverse agent search tasks, with several metrics surpassing those of larger teacher models. This work establishes a novel paradigm for deploying lightweight models in resource-constrained RAG applications.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.
Problem

Research questions and friction points this paper is trying to address.

Enabling compact language models for agentic search behaviors
Overcoming sparse rewards and unstable training in small models
Achieving agentic RAG capabilities in resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation-Guided Policy Optimization for agentic RAG
Cold-start initialization from teacher demonstrations
Continuous teacher guidance during policy optimization
🔎 Similar Papers