🤖 AI Summary
Existing LLM benchmarks lack natural, time-consuming information-seeking questions requiring long-horizon reasoning, thus failing to authentically evaluate models’ information retrieval and multi-step reasoning capabilities. To address this, we introduce MoNaCo: the first large-scale, high-complexity benchmark of natural questions—1,315 in total—each demanding retrieval across dozens to hundreds of documents and execution of数十 to over one hundred reasoning steps. Questions are rigorously validated via decomposition-based human annotation and LLM-assisted verification. We systematically evaluate leading LLMs across three dimensions: F1 score, recall, and hallucination rate. Results reveal that state-of-the-art models achieve a maximum F1 of only 61.2%, exposing severe recall deficiencies and factual hallucinations. These findings highlight long-horizon reasoning and comprehensive information coverage as critical bottlenecks. MoNaCo thus establishes a reliable, challenging evaluation standard and concrete improvement directions for next-generation retrieval-augmented reasoning models.
📝 Abstract
Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions -- with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: https://tomerwolgithub.github.io/monaco