π€ AI Summary
To address the high hallucination rates, low factual accuracy, and poor response consistency of small language models (SLMs) on knowledge-intensive tasks, this paper proposes MCTS-RAGβa novel reasoning framework that deeply integrates Monte Carlo Tree Search (MCTS) with Retrieval-Augmented Generation (RAG). Unlike prior approaches, MCTS-RAG enables dynamic co-optimization of retrieval context and reasoning paths at every MCTS node. It introduces a confidence-driven iterative re-ranking and node expansion mechanism, endowing the inference process with adaptivity and explicit fact-awareness. Evaluated on ComplexWebQA, GPQA, and FoolMeTwice benchmarks, MCTS-RAG enables SLMs to match the performance of GPT-4o while substantially reducing hallucinations and significantly improving both factual accuracy and response consistency. This work marks the first framework to achieve tight, joint optimization of retrieval and generation within an MCTS-based reasoning paradigm.
π Abstract
We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks by leveraging retrieval-augmented generation (RAG) to provide relevant context and Monte Carlo Tree Search (MCTS) to refine reasoning paths. MCTS-RAG dynamically integrates retrieval and reasoning through an iterative decision-making process. Unlike standard RAG methods, which typically retrieve information independently from reasoning and thus integrate knowledge suboptimally, or conventional MCTS reasoning, which depends solely on internal model knowledge without external facts, MCTS-RAG combines structured reasoning with adaptive retrieval. This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency. The experimental results on multiple reasoning and knowledge-intensive datasets datasets (i.e., ComplexWebQA, GPQA, and FoolMeTwice) show that our method enables small-scale LMs to achieve performance comparable to frontier LLMs like GPT-4o by effectively scaling inference-time compute, setting a new standard for reasoning in small-scale models.