🤖 AI Summary
Existing multimodal instruction-following evaluation benchmarks primarily focus on textual instructions and overlook implicit constraints embedded in the visual modality, thereby failing to comprehensively assess models’ alignment capabilities under joint vision-language instructions. To address this gap, this work proposes VC-IFEval—the first vision-centric instruction-following evaluation framework—which systematically constructs an instruction dataset incorporating vision-dependent constraints to enable fine-grained assessment of multimodal large language models (MLLMs). Experiments demonstrate that this benchmark effectively uncovers deficiencies of current models in handling vision-related instructions and that targeted fine-tuning significantly improves both instruction-following accuracy and output consistency, thereby filling a critical void in evaluating visual constraint adherence within multimodal instruction understanding.
📝 Abstract
Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs'instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs'instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.