π€ AI Summary
This work addresses the susceptibility of large vision-language models to hallucinations under degraded visual inputs at test time, which compromises their practical reliability. To mitigate this issue, the authors propose ClipTTT, a novel approach that leverages CLIPβs image-text alignment capability within a test-time training framework. ClipTTT introduces a single-sample self-supervised objective to enable immediate model adaptation without altering the original architecture. By dynamically adjusting to input degradations, the method effectively counteracts distributional shifts induced by common image corruptions. Evaluated across 15 prevalent degradation types, ClipTTT significantly reduces hallucination rates and enhances the faithfulness of generated textual descriptions, demonstrating robustness and practical utility in real-world scenarios.
π Abstract
Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.