🤖 AI Summary
Large language models (LLMs) exhibit weak factual grounding and poor logical coherence in knowledge-intensive complex reasoning tasks—e.g., commonsense and medical reasoning.
Method: We propose rStar, a novel framework that integrates dual retrieval-augmented actions (A6/A7) into Monte Carlo Tree Search (MCTS), replaces the conventional discriminator with a retrieval-augmented factuality scorer for fact-prioritized path selection, and incorporates dynamic subproblem retrieval, context-aware re-answering, and multi-hop fact verification.
Contribution/Results: Evaluated on LLaMA-3.1, rStar significantly improves both logical coherence and factual accuracy. On multiple benchmarks—including CommonsenseQA, MedQA, and StrategyQA—it achieves open-source model performance competitive with GPT-4 and GPT-4o. To our knowledge, rStar is the first MCTS-based framework to enable fact-aware reasoning, establishing a new paradigm for factual, search-driven inference in LLMs.
📝 Abstract
This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a versatile extension to the mutual reasoning framework (rStar), aimed at enhancing reasoning accuracy and factual integrity across large language models (LLMs) for complex, knowledge-intensive tasks such as commonsense and medical reasoning. RARE incorporates two innovative actions within the Monte Carlo Tree Search (MCTS) framework: A6, which generates search queries based on the initial problem statement, performs information retrieval using those queries, and augments reasoning with the retrieved data to formulate the final answer; and A7, which leverages information retrieval specifically for generated sub-questions and re-answers these sub-questions with the relevant contextual information. Additionally, a Retrieval-Augmented Factuality Scorer is proposed to replace the original discriminator, prioritizing reasoning paths that meet high standards of factuality. Experimental results with LLaMA 3.1 show that RARE enables open-source LLMs to achieve competitive performance with top open-source models like GPT-4 and GPT-4o. This research establishes RARE as a scalable solution for improving LLMs in domains where logical coherence and factual integrity are critical.