🤖 AI Summary
This work addresses the challenge in multi-hop retrieval-augmented generation (RAG) systems, where suboptimal decisions about when to terminate retrieval often compromise either efficiency or accuracy. The study presents the first systematic replication and extension of the MetaRAG framework, incorporating a metacognitive mechanism that enables large language models to self-critique and iteratively refine their reasoning processes. Through empirical comparisons of PointWise and ListWise re-ranking strategies and benchmarking against the lightweight critic model SIM-RAG, the authors demonstrate that MetaRAG is highly sensitive to re-ranking choices and exhibits substantially greater robustness than SIM-RAG. Results indicate that MetaRAG significantly outperforms both standard RAG and reasoning-only baselines; however, its absolute performance is influenced by external factors such as updates to closed-source foundation models, while re-ranking consistently yields notable gains in overall effectiveness.
📝 Abstract
Recently, Retrieval Augmented Generation (RAG) has shifted focus to multi-retrieval approaches to tackle complex tasks such as multi-hop question answering. However, these systems struggle to decide when to stop searching once enough information has been gathered. To address this, \citet{zhou2024metacognitive} introduced Metacognitive Retrieval Augmented Generation (MetaRAG), a framework inspired by metacognition that enables Large Language Models to critique and refine their reasoning. In this reproducibility paper, we reproduce MetaRAG following its original experimental setup and extend it in two directions: (i) by evaluating the effect of PointWise and ListWise rerankers, and (ii) by comparing with SIM-RAG, which employs a lightweight critic model to stop retrieval. Our results confirm MetaRAG's relative improvements over standard RAG and reasoning-based baselines, but also reveal lower absolute scores than reported, reflecting challenges with closed-source LLM updates, missing implementation details, and unreleased prompts. We show that MetaRAG is partially reproduced, gains substantially from reranking, and is more robust than SIM-RAG when extended with additional retrieval features.