ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the susceptibility of large vision-language models to hallucinations under degraded visual inputs at test time, which compromises their practical reliability. To mitigate this issue, the authors propose ClipTTT, a novel approach that leverages CLIP’s image-text alignment capability within a test-time training framework. ClipTTT introduces a single-sample self-supervised objective to enable immediate model adaptation without altering the original architecture. By dynamically adjusting to input degradations, the method effectively counteracts distributional shifts induced by common image corruptions. Evaluated across 15 prevalent degradation types, ClipTTT significantly reduces hallucination rates and enhances the faithfulness of generated textual descriptions, demonstrating robustness and practical utility in real-world scenarios.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

Problem

Research questions and friction points this paper is trying to address.

hallucination

vision-language models

visual corruptions

distribution shift

test-time degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time training

CLIP guidance

vision-language models