🤖 AI Summary
Reinforcement learning (RL) in retrieval-augmented generation (RAG) for multi-hop question answering suffers from two key bottlenecks: lack of global reasoning planning and unfaithful execution. Method: We propose a collaborative RL framework that jointly optimizes retrieval and reasoning. It structures multi-step inference via subgoal decomposition, integrates iterative evidence refinement with coherent planning, and introduces a dual-granularity reward mechanism—comprising a planning-quality reward and a subgoal-completion reward—alongside progressive weight annealing to balance process consistency and final answer accuracy. Contribution/Results: Our method significantly outperforms strong baselines on both in-domain and cross-domain benchmarks. Remarkably, it achieves average improvements of 14.2% in Exact Match (EM) and F1 scores using only 42% of the training data, empirically validating the effectiveness of co-optimizing global reasoning modeling and faithful execution.
📝 Abstract
Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.