🤖 AI Summary
To address the limited performance of large language models (LLMs) on multi-hop question answering—tasks demanding tight coupling between reasoning and retrieval—this paper proposes a three-stage dynamic collaborative prompting strategy: (1) decomposing the question, (2) dynamically retrieving and constructing a fact database grounded in intermediate reasoning outputs, and (3) synthesizing the final answer. Its core innovation lies in modeling reasoning as “taking notes for one’s future self,” enabling closed-loop interaction between retrieval and reasoning. The method integrates chain-of-thought (CoT), retrieval-augmented generation (RAG), and the ReAct framework to support end-to-end multi-step inference. Experiments demonstrate substantial gains: a 100% improvement in CoT accuracy on PhantomWiki; a 20.1-point F1-score gain over CoT-RAG on 2WikiMultiHopQA; and a 3.2-point F1 advantage over ReAct on MuSiQue—significantly outperforming IRCoT and other baselines.
📝 Abstract
Large language models (LLMs) excel at reasoning-only tasks, but struggle when reasoning must be tightly coupled with retrieval, as in multi-hop question answering. To overcome these limitations, we introduce a prompting strategy that first decomposes a complex question into smaller steps, then dynamically constructs a database of facts using LLMs, and finally pieces these facts together to solve the question. We show how this three-stage strategy, which we call Memento, can boost the performance of existing prompting strategies across diverse settings. On the 9-step PhantomWiki benchmark, Memento doubles the performance of chain-of-thought (CoT) when all information is provided in context. On the open-domain version of 2WikiMultiHopQA, CoT-RAG with Memento improves over vanilla CoT-RAG by more than 20 F1 percentage points and over the multi-hop RAG baseline, IRCoT, by more than 13 F1 percentage points. On the challenging MuSiQue dataset, Memento improves ReAct by more than 3 F1 percentage points, demonstrating its utility in agentic settings.