Empowering Reliable Visual-Centric Instruction Following in MLLMs

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing multimodal instruction-following evaluation benchmarks primarily focus on textual instructions and overlook implicit constraints embedded in the visual modality, thereby failing to comprehensively assess models’ alignment capabilities under joint vision-language instructions. To address this gap, this work proposes VC-IFEval—the first vision-centric instruction-following evaluation framework—which systematically constructs an instruction dataset incorporating vision-dependent constraints to enable fine-grained assessment of multimodal large language models (MLLMs). Experiments demonstrate that this benchmark effectively uncovers deficiencies of current models in handling vision-related instructions and that targeted fine-tuning significantly improves both instruction-following accuracy and output consistency, thereby filling a critical void in evaluating visual constraint adherence within multimodal instruction understanding.

Technology Category

Application Category

📝 Abstract

Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs'instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs'instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.

Problem

Research questions and friction points this paper is trying to address.

instruction following

multimodal large language models

visual constraints

benchmark evaluation

visual-centric instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-centric instruction following

multimodal large language models

instruction-following evaluation