MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing LLM benchmarks lack natural, time-consuming information-seeking questions requiring long-horizon reasoning, thus failing to authentically evaluate models’ information retrieval and multi-step reasoning capabilities. To address this, we introduce MoNaCo: the first large-scale, high-complexity benchmark of natural questions—1,315 in total—each demanding retrieval across dozens to hundreds of documents and execution of数十 to over one hundred reasoning steps. Questions are rigorously validated via decomposition-based human annotation and LLM-assisted verification. We systematically evaluate leading LLMs across three dimensions: F1 score, recall, and hallucination rate. Results reveal that state-of-the-art models achieve a maximum F1 of only 61.2%, exposing severe recall deficiencies and factual hallucinations. These findings highlight long-horizon reasoning and comprehensive information coverage as critical bottlenecks. MoNaCo thus establishes a reliable, challenging evaluation standard and concrete improvement directions for next-generation retrieval-augmented reasoning models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions -- with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: https://tomerwolgithub.github.io/monaco

Problem

Research questions and friction points this paper is trying to address.

Creating natural complex questions for multi-document reasoning

Addressing low recall and hallucinations in LLM responses

Providing a benchmark for real-world information-seeking questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed annotation pipeline for complex questions

Benchmark with 1,315 natural multi-step questions

Publicly available dataset for reasoning evaluation

🔎 Similar Papers

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models