Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address hallucination in retrieval-augmented generation (RAG) and reward hacking caused by insufficient explicit guidance during complex multi-step reasoning, this paper proposes Bi-RAR, a bidirectional retrieval-augmented reasoning framework. Methodologically, Bi-RAR integrates retrieval-augmented generation, search interaction mechanisms, and language model-based probabilistic approximation to compute information distance. Its core contributions are: (1) introducing a bidirectional information distance grounded in Kolmogorov complexity to quantify the information completeness of each reasoning step; and (2) designing a cascaded multi-objective reinforcement learning framework that jointly optimizes forward reasoning and backward verification. Evaluated on seven question-answering benchmarks, Bi-RAR significantly outperforms state-of-the-art methods, achieving substantial gains in both reasoning accuracy and robustness. Moreover, it enables efficient collaboration between large language models and search engines during both training and inference.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios. Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.

Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in multi-step reasoning with retrieval-augmented generation

Addressing reward hacking from outcome-based supervision in iterative reasoning

Optimizing bidirectional step evaluation for improved information completeness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional evaluation of intermediate reasoning steps

Bidirectional information distance using generation probabilities

Multi-objective reinforcement learning with cascading rewards

🔎 Similar Papers

No similar papers found.