ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

๐Ÿ“… 2025-05-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing VI-CoT evaluation benchmarks rely on fixed intermediate visual states (IVS), distorting authentic reasoning trajectories and lacking systematic analysis of IVS influence mechanisms. To address this, we propose ViC-Benchโ€”the first multi-task benchmark enabling free-form IVS generation and evaluation across maze navigation, jigsaw solving, embodied long-horizon planning, and complex counting. Our method introduces a free-form IVS generation pipeline, function-call-driven IVS modeling, a three-stage progressive evaluation framework, and an incremental prompt information injection (IPII) strategy to decouple the impact of individual prompt components on VI-CoT performance. We conduct large-scale evaluation across 18 state-of-the-art multimodal large language models (MLLMs), uncovering critical bottlenecks in their VI-CoT capabilities. ViC-Bench is publicly released on Hugging Face, establishing a new paradigm for interpretable and evolvable visual reasoning assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' reasoning with free-style visual states
Assessing impact of intermediate states on reasoning performance
Benchmarking visual-interleaved chain-of-thought in diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Free-style IVS generation pipeline for diverse tasks
Progressive three-stage evaluation suite with new metrics
Incremental Prompting Information Injection (IPII) strategy
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xuecheng Wu
Xiโ€™an Jiaotong University
J
Jiaxing Liu
Meituan Inc
D
Danlei Huang
Xiโ€™an Jiaotong University
X
Xiaoyu Li
Meituan Inc
Y
Yifan Wang
University of Science and Technology of China
C
Chen Chen
Meituan Inc
Liya Ma
Liya Ma
University of Malaya
RF-MEMSPrintable electronicsMicroelectronics
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs
Junxiao Xue
Junxiao Xue
Zhejiang Lab
Computer GraphicsCrowd simulationMulti-agents ModelingMulti-modal Learning