Test-Time Consistency in Vision Language Models

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Visual language models (VLMs) often exhibit inconsistent predictions under semantically equivalent inputs, undermining their robustness and reliability. To address this, we propose a **training-free, model-agnostic test-time consistency framework** that generates semantic-preserving augmented views from a single test sample and jointly optimizes output distribution consistency via a **cross-entropy alignment loss** and a **pseudo-label consensus loss**. The method operates entirely post-hoc—requiring no architectural modifications or parameter updates—and achieves multimodal consistency enhancement purely during inference, the first of its kind. Evaluated on the MM-R3 benchmark, it significantly improves consistency across diverse state-of-the-art VLMs—including CLIP, BLIP-2, and Qwen-VL—demonstrating lightweight, general-purpose, and plug-and-play inference-time robustness enhancement.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks, yet they often exhibit inconsistent behavior when faced with semantically equivalent inputs, undermining their reliability and robustness. Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs, despite maintaining high average accuracy. Prior work addresses this issue by modifying model architectures or conducting large-scale fine-tuning on curated datasets. In contrast, we propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training. Our method is entirely post-hoc, model-agnostic, and applicable to any VLM with access to its weights. Given a single test point, we enforce consistent predictions via two complementary objectives: (i) a Cross-Entropy Agreement Loss that aligns predictive distributions across semantically equivalent inputs, and (ii) a Pseudo-Label Consistency Loss that draws outputs toward a self-averaged consensus. Our method is plug-and-play and leverages information from a single test input itself to improve consistency. Experiments on the MM-R3 benchmark show that our framework yields substantial gains in consistency across state-of-the-art models, establishing a new direction for inference-time adaptation in multimodal learning.

Problem

Research questions and friction points this paper is trying to address.

VLMs show inconsistent behavior with semantically equivalent inputs

Existing methods require model changes or large-scale fine-tuning

Propose test-time framework to enhance consistency without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time consistency framework for VLMs

Cross-Entropy Agreement Loss for alignment

Pseudo-Label Consistency Loss for consensus

🔎 Similar Papers

No similar papers found.