From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

While large language models (LLMs) excel at passive reasoning, their capabilities in active reasoning—where models must proactively interact with external environments to acquire missing information—lack systematic evaluation. Method: We introduce AR-Bench, the first benchmark explicitly designed for active reasoning, formally defining and quantifying this capability across three realistic interactive domains: commonsense, logical, and symbolic reasoning. Our methodology incorporates multi-turn interactive prompting, task-driven environment simulation, and ablation studies via tree search and post-training strategies. Results: Experiments reveal a substantial performance gap: state-of-the-art LLMs achieve significantly lower accuracy in active reasoning compared to passive reasoning, and existing optimization techniques yield only marginal improvements—highlighting a critical capability gap. AR-Bench is publicly released as the first standardized evaluation platform for active reasoning research.

Technology Category

Application Category

📝 Abstract

While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' active reasoning with incomplete information

Measures performance in commonsense, logical, symbolic reasoning

Highlights gap between passive and active reasoning abilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

AR-Bench evaluates LLM active reasoning skills

Incorporates interactive learning and feedback loops

Uses environment-aware objectives for training

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?