Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations assess deductive, inductive, or abductive reasoning in isolation, failing to capture large language models’ (LLMs) integrated reasoning capabilities in unfamiliar environments. Method: We propose a novel black-box interaction paradigm wherein models infer hidden functions solely from limited rounds of input-output observations—unifying the assessment of all three reasoning types through collaborative, iterative hypothesis generation, testing, and refinement. To support this, we introduce Oracle, the first benchmark comprising six task categories and 96 diverse black-box functions, enabling end-to-end evaluation of high-level planning and adaptive exploration. Contribution/Results: We evaluate 19 state-of-the-art LLMs; while top models (e.g., o3) achieve >70% accuracy on simple tasks, performance drops below 40% on challenging ones—revealing systemic weaknesses in dynamic hypothesis formation, validation, and exploration strategy. Oracle provides a scalable, reproducible framework for rigorous, holistic reasoning evaluation.

Technology Category

Application Category

📝 Abstract
Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, extit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the extsc{Oracle} benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' integrated reasoning in unknown interactive environments
Assessing deductive, inductive and abductive reasoning capabilities holistically
Developing adaptive exploration strategies for black-box function discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box interaction evaluation paradigm
Oracle benchmark with 96 black-boxes
Tests LLM planning and hypothesis refinement
🔎 Similar Papers
No similar papers found.
C
Congchi Yin
Nanjing University of Aeronautics and Astronautics
T
Tianyi Wu
Peking University
Y
Yankai Shu
Peking University
Alex Gu
Alex Gu
MIT
program synthesismachine learninglarge language modelscode generation
Y
Yunhan Wang
Peking University
Jun Shao
Jun Shao
Professor of Statistics, University of Wisconsin Madison
Statistics
X
Xun Jiang
Theta Health Inc.
P
Piji Li
Nanjing University of Aeronautics and Astronautics