V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) excel at single-step closed-ended visual question answering but exhibit limited capability in open-ended tasks requiring multi-step visual exploration and reasoning, with no established quantitative evaluation framework. Method: We introduce V-REX, the first benchmark for multi-step exploratory visual reasoning, formalizing tasks as structured Chain-of-Questions (CoQ) processes. V-REX explicitly decouples “planning” (generating an exploratory question chain) from “following” (answering questions sequentially). It employs constrained intermediate steps with finite options, human-curated multi-domain complex tasks, stepwise constraints, and consistency verification to enable fine-grained, reproducible, quantitative assessment of visual reasoning paths. Contribution/Results: Comprehensive evaluation reveals strong scaling trends across VLMs, yet planning ability consistently lags behind following ability—highlighting a critical bottleneck and substantial room for improvement in multi-step exploratory reasoning.

Technology Category

Application Category

📝 Abstract
While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-step visual reasoning in vision-language models
Assessing planning and following abilities in exploratory tasks
Quantifying intermediate steps in complex open-ended visual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Questions framework for multi-step visual reasoning
Evaluation protocol with finite options for reliable step analysis
Benchmark covering diverse domains to assess planning and following abilities
🔎 Similar Papers
No similar papers found.
C
Chenrui Fan
University of Maryland, College Park
Yijun Liang
Yijun Liang
yliang17@umd.edu
S
Shweta Bhardwaj
University of Maryland, College Park
K
Kwesi Cobbina
University of Maryland, College Park
M
Ming Li
University of Maryland, College Park
T
Tianyi Zhou
University of Maryland, College Park