VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current vision-language models (LVLMs) exhibit limited capability in multi-image, multi-turn visual reasoning tasks, heavily relying on linguistic priors while neglecting progressive, context-sensitive, and vision-to-vision reasoning. To address this, we introduce VIR-Bench—the first large-scale benchmark dedicated to multi-step visual chain reasoning—comprising 1,457 tasks and over 20,000 images spanning everyday and engineering decision-making scenarios. Our method employs a novel multi-agent generative pipeline that enables cross-image contextual modeling and low-language-guidance design, effectively mitigating language bias and enhancing visual diversity. Comprehensive evaluation of state-of-the-art LVLMs reveals, for the first time, their critical reasoning bottlenecks under high visual dependency. VIR-Bench thus establishes a new standard and methodology to advance truly vision-centric, multi-turn visual understanding.

Technology Category

Application Category

📝 Abstract

Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons -- e.g., spotting visual differences or assessing appropriateness -- while relying heavily on language cues. Such settings overlook progressive, context-dependent reasoning and the challenge of visual-to-visual inference. To bridge this gap, we present VisChainBench, a large-scale benchmark designed to rigorously evaluate LVLMs' ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real-world decision-making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench

Problem

Research questions and friction points this paper is trying to address.

Evaluates multi-step visual reasoning in LVLMs

Assesses context-dependent reasoning across sequential tasks

Reduces reliance on language cues in visual inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent pipeline generates diverse visual benchmark

Sequential interdependent tasks test progressive visual reasoning

Minimizes language cues to focus on visual inference

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?