Empowering Reliable Visual-Centric Instruction Following in MLLMs

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal instruction-following evaluation benchmarks primarily focus on textual instructions and overlook implicit constraints embedded in the visual modality, thereby failing to comprehensively assess models’ alignment capabilities under joint vision-language instructions. To address this gap, this work proposes VC-IFEval—the first vision-centric instruction-following evaluation framework—which systematically constructs an instruction dataset incorporating vision-dependent constraints to enable fine-grained assessment of multimodal large language models (MLLMs). Experiments demonstrate that this benchmark effectively uncovers deficiencies of current models in handling vision-related instructions and that targeted fine-tuning significantly improves both instruction-following accuracy and output consistency, thereby filling a critical void in evaluating visual constraint adherence within multimodal instruction understanding.

Technology Category

Application Category

📝 Abstract
Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs'instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs'instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.
Problem

Research questions and friction points this paper is trying to address.

instruction following
multimodal large language models
visual constraints
benchmark evaluation
visual-centric instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-centric instruction following
multimodal large language models
instruction-following evaluation
vision-dependent constraints
VC-IFEval
🔎 Similar Papers
No similar papers found.