Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

178K/year
πŸ€– AI Summary
This work addresses the challenge of reliably evaluating logical consistency in multimodal large language models (MLLMs) without relying on ground-truth annotationsβ€”a limitation of existing methods that are prone to interference from random guessing. We propose VL-LCM, the first annotation-free framework for assessing visual-linguistic logical consistency, which leverages necessary and sufficient causal relationships through vision-language alignment modeling and an unsupervised consistency scoring mechanism. Systematic evaluation across 11 prominent open-source MLLMs on benchmarks including MMMU, MC-VQA, and NaturalBench reveals that despite high accuracy, current models exhibit substantial deficits in logical consistency. VL-LCM demonstrates strong correlation with supervised metrics and functions as a reliability indicator independent of accuracy, effectively supporting model selection and answer trustworthiness assessment.
πŸ“ Abstract
Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.
Problem

Research questions and friction points this paper is trying to address.

MLLMs
annotation-free
logical consistency
vision-language
model validation
Innovation

Methods, ideas, or system contributions that make the work stand out.

logical consistency
annotation-free evaluation
vision-language models
causal reasoning
VL-LCM
πŸ”Ž Similar Papers
2024-08-29arXiv.orgCitations: 7
Ying Gu
Ying Gu
German Research Center for Artificial Intelligence
Anomaly DetectionData MiningBig DataArtificial Intelligence
M
Mei Chee Leong
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
H
Hui Li Tan
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
S
Shangbo Mao
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore
Liyuan Li
Liyuan Li
Senior Scientist of Institute for Infocomm Research, Singapore
computer visionmachine learningpattern recognitionartificial intelligencecognitive science
N
Nancy Chen
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore