Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit suboptimal search behaviors—such as under-invocation, ineffective queries, and redundant searches—in search-augmented question answering, primarily due to reliance solely on final-answer rewards (e.g., exact match) during training. Method: We propose the Decoupled Search-and-Answer (DeSA) framework, which for the first time explicitly separates search policy optimization from answer generation via a two-stage reinforcement learning pipeline: Stage I optimizes search effectiveness using retrieval recall as the reward; Stage II optimizes answer generation using answer quality as the reward. This decoupling mitigates the weak search signal problem inherent in end-to-end joint training. Contribution/Results: Extensive experiments across seven open-domain QA benchmarks demonstrate that DeSA significantly improves both search recall (+12.3% on average) and final answer accuracy (+8.7% on average), consistently outperforming single-stage baselines.

Technology Category

Application Category

📝 Abstract
Enabling large language models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search-augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome-based rewards (e.g., exact match), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome-only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce DeSA (Decoupling Search-and-Answering), a simple two-stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with retrieval recall-based rewards. In Stage 2, outcome rewards are employed to optimize final answer generation. Across seven QA benchmarks, DeSA-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines. Notably, DeSA outperforms single-stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.
Problem

Research questions and friction points this paper is trying to address.

Addresses outcome-only training deficiencies in search-augmented LLMs
Improves search behaviors like tool invocation and query validity
Decouples search optimization from answer generation for accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples search and answering training stages
Uses retrieval recall rewards for search optimization
Employs outcome rewards for answer generation
🔎 Similar Papers
No similar papers found.
Y
Yiding Wang
Department of Computer Science, University of Virginia
Zhepei Wei
Zhepei Wei
University of Virginia
Machine LearningNatural Language ProcessingLarge Language Models
X
Xinyu Zhu
Department of Computer Science, University of Virginia
Y
Yu Meng
Department of Computer Science, University of Virginia