VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (LVLMs) exhibit limited capability in multi-image, multi-turn visual reasoning tasks, heavily relying on linguistic priors while neglecting progressive, context-sensitive, and vision-to-vision reasoning. To address this, we introduce VIR-Bench—the first large-scale benchmark dedicated to multi-step visual chain reasoning—comprising 1,457 tasks and over 20,000 images spanning everyday and engineering decision-making scenarios. Our method employs a novel multi-agent generative pipeline that enables cross-image contextual modeling and low-language-guidance design, effectively mitigating language bias and enhancing visual diversity. Comprehensive evaluation of state-of-the-art LVLMs reveals, for the first time, their critical reasoning bottlenecks under high visual dependency. VIR-Bench thus establishes a new standard and methodology to advance truly vision-centric, multi-turn visual understanding.

Technology Category

Application Category

📝 Abstract
Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons -- e.g., spotting visual differences or assessing appropriateness -- while relying heavily on language cues. Such settings overlook progressive, context-dependent reasoning and the challenge of visual-to-visual inference. To bridge this gap, we present VisChainBench, a large-scale benchmark designed to rigorously evaluate LVLMs' ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real-world decision-making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench
Problem

Research questions and friction points this paper is trying to address.

Evaluates multi-step visual reasoning in LVLMs
Assesses context-dependent reasoning across sequential tasks
Reduces reliance on language cues in visual inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent pipeline generates diverse visual benchmark
Sequential interdependent tasks test progressive visual reasoning
Minimizes language cues to focus on visual inference
🔎 Similar Papers
No similar papers found.
W
Wenbo Lyu
University of Chinese Academy of Sciences
Yingjun Du
Yingjun Du
University of Amseterdam
Meta-learningVision-language model
J
Jinglin Zhao
Huazhong University of Science and Technology
X
Xianton Zhen
United Imaging Healthcare
L
Ling Shao
University of Chinese Academy of Sciences