Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of distinguishing whether multimodal agents engage in strategic reasoning or rely on random trial-and-error when navigating document collections. To this end, we introduce the MADQA benchmark, comprising 2,250 human-authored questions and 800 heterogeneous PDF documents, with high-discrimination tasks designed based on classical test theory. We propose a novel evaluation protocol that quantifies the trade-off between answer accuracy and retrieval effort, offering the first metric to assess the strategic nature of an agent’s reasoning. Experimental results reveal that while state-of-the-art agents achieve overall accuracy comparable to humans, their success distributions differ markedly, and they remain nearly 20% below ideal performance, often trapped in inefficient retrieval loops. The dataset and evaluation toolkit are publicly released to advance research in efficient multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
Problem

Research questions and friction points this paper is trying to address.

strategic reasoning
stochastic search
multimodal agents
document navigation
agent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

strategic reasoning
multimodal agents
document navigation
accuracy-effort trade-off
MADQA benchmark