🤖 AI Summary
Large language models (LLMs) exhibit suboptimal search behaviors—such as under-invocation, ineffective queries, and redundant searches—in search-augmented question answering, primarily due to reliance solely on final-answer rewards (e.g., exact match) during training.
Method: We propose the Decoupled Search-and-Answer (DeSA) framework, which for the first time explicitly separates search policy optimization from answer generation via a two-stage reinforcement learning pipeline: Stage I optimizes search effectiveness using retrieval recall as the reward; Stage II optimizes answer generation using answer quality as the reward. This decoupling mitigates the weak search signal problem inherent in end-to-end joint training.
Contribution/Results: Extensive experiments across seven open-domain QA benchmarks demonstrate that DeSA significantly improves both search recall (+12.3% on average) and final answer accuracy (+8.7% on average), consistently outperforming single-stage baselines.
📝 Abstract
Enabling large language models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search-augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome-based rewards (e.g., exact match), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome-only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce DeSA (Decoupling Search-and-Answering), a simple two-stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with retrieval recall-based rewards. In Stage 2, outcome rewards are employed to optimize final answer generation. Across seven QA benchmarks, DeSA-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines. Notably, DeSA outperforms single-stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.