🤖 AI Summary
This work addresses a critical limitation in existing retrieval-augmented reasoning methods, which often neglect dependencies among sub-skills and rely excessively on strong-model distillation, leading to early derailment in multi-hop retrieval due to initial noise. To mitigate this, the authors propose a structured planning (Plan) mechanism that decomposes the original question into an ordered sequence of sub-questions prior to retrieval, ensuring each retrieval step targets a well-defined objective. They further introduce a distillation-free bootstrapping framework, wherein a small-scale seed model generates high-quality reasoning trajectories to activate the planning capability of larger models. Notably, the study reveals for the first time that identical reward signals induce heterogeneous reinforcement learning failure modes across models of different scales. The approach consistently activates the Plan mechanism across models ranging from 3B to 14B parameters and achieves sustained improvements over current baselines on multi-hop question answering benchmarks.
📝 Abstract
Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.