🤖 AI Summary
Existing search engine–based tool-integrated reasoning (TIR) agents rely on reinforcement learning but suffer from sparse rewards, inefficient exploration, and training instability in complex multi-hop question answering. To address these challenges, we propose the Backtracking Critic—a novel critic mechanism that generates dense, stepwise feedback by leveraging complete reasoning trajectories and ground-truth answers, enabling fine-grained credit assignment. The critic employs a frozen, asymmetric large language model to deliver stable, episode-level evaluation signals, jointly optimizing tool invocation and multi-hop retrieval strategies. Evaluated on mainstream multi-hop reasoning benchmarks, our approach significantly outperforms strong baselines: it accelerates convergence by 32%, improves final performance by 11.4%, and reduces training variance by 47%. The method thus achieves superior efficiency, training stability, and generalization across diverse reasoning tasks.
📝 Abstract
Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.