PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses key challenges in open-domain multi-hop question answering—namely, retrieval collapse due to insufficient reasoning guidance and weak credit assignment in reinforcement learning that hinders error localization across modules. To this end, the authors propose PRISMA, a Plan-Retrieve-Inspect-Solve-Memoize framework that leverages multi-agent collaboration to enable reasoning-guided retrieval and verification. The framework introduces an Inspector module that provides feedback to the Planner to refine question decomposition and retrieval strategies. It further employs a two-stage policy optimization mechanism: the first stage trains the Planner and Solver independently, while the second stage enhances the Inspector’s contextual validation and recovery capabilities through Observation-Aware Residual Policy Optimization (OARPO). Evaluated on ten benchmark datasets, PRISMA achieves state-of-the-art performance with high stability and strong cross-task generalization, making it suitable for efficient deployment in real-world scenarios.

Technology Category

Application Category

📝 Abstract

Answering real-world open-domain multi-hop questions over massive corpora is a critical challenge in Retrieval-Augmented Generation (RAG) systems. Recent research employs reinforcement learning (RL) to end-to-end optimize the retrieval-augmented reasoning process, directly enhancing its capacity to resolve complex queries. However, reliable deployment is hindered by two obstacles. 1) Retrieval Collapse: iterative retrieval over large corpora fails to locate intermediate evidence containing bridge answers without reasoning-guided planning, causing downstream reasoning to collapse. 2) Learning Instability: end-to-end trajectory training suffers from weak credit assignment across reasoning chains and poor error localization across modules, causing overfitting to benchmark-specific heuristics that limit transferability and stability. To address these problems, we propose PRISMA, a decoupled RL-guided framework featuring a Plan-Retrieve-Inspect-Solve-Memoize architecture. PRISMA's strength lies in reasoning-guided collaboration: the Inspector provides reasoning-based feedback to refine the Planner's decomposition and fine-grained retrieval, while enforcing evidence-grounded reasoning in the Solver. We optimize individual agent capabilities via Two-Stage Group Relative Policy Optimization (GRPO). Stage I calibrates the Planner and Solver as specialized experts in planning and reasoning, while Stage II utilizes Observation-Aware Residual Policy Optimization (OARPO) to enhance the Inspector's ability to verify context and trigger targeted recovery. Experiments show that PRISMA achieves state-of-the-art performance on ten benchmarks and can be deployed efficiently in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

Multi-Hop Question Answering

Retrieval Collapse

Learning Instability

Reinforcement Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Multi-Hop Question Answering

Retrieval-Augmented Generation