Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing search-augmented reasoning methods rely on complex training mechanisms and external supervision, resulting in redundant pipelines and high computational costs. This work proposes Search-E1, a streamlined approach that eliminates external supervision and auxiliary modules, instead leveraging only agent-generated data for self-evolution. Search-E1 alternates between GRPO reinforcement learning and offline self-distillation guided by token-level forward KL divergence, enabling fine-grained, step-level supervision. Using Qwen2.5-3B as the base model, Search-E1 achieves an average Exact Match score of 0.440 across seven question-answering benchmarks, significantly outperforming all open-source baselines of comparable scale.

📝 Abstract

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with offline self-distillation (OFSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches $0.440$ average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.

Problem

Research questions and friction points this paper is trying to address.

search-augmented reasoning

post-training

self-distillation

language model

training pipeline complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation

search-augmented reasoning

GRPO