π€ AI Summary
This work addresses the challenge of reliably evaluating logical consistency in multimodal large language models (MLLMs) without relying on ground-truth annotationsβa limitation of existing methods that are prone to interference from random guessing. We propose VL-LCM, the first annotation-free framework for assessing visual-linguistic logical consistency, which leverages necessary and sufficient causal relationships through vision-language alignment modeling and an unsupervised consistency scoring mechanism. Systematic evaluation across 11 prominent open-source MLLMs on benchmarks including MMMU, MC-VQA, and NaturalBench reveals that despite high accuracy, current models exhibit substantial deficits in logical consistency. VL-LCM demonstrates strong correlation with supervised metrics and functions as a reliability indicator independent of accuracy, effectively supporting model selection and answer trustworthiness assessment.
π Abstract
Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.