ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

πŸ“… 2026-03-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the susceptibility of large vision-language models to hallucinations under degraded visual inputs at test time, which compromises their practical reliability. To mitigate this issue, the authors propose ClipTTT, a novel approach that leverages CLIP’s image-text alignment capability within a test-time training framework. ClipTTT introduces a single-sample self-supervised objective to enable immediate model adaptation without altering the original architecture. By dynamically adjusting to input degradations, the method effectively counteracts distributional shifts induced by common image corruptions. Evaluated across 15 prevalent degradation types, ClipTTT significantly reduces hallucination rates and enhances the faithfulness of generated textual descriptions, demonstrating robustness and practical utility in real-world scenarios.
πŸ“ Abstract
Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.
Problem

Research questions and friction points this paper is trying to address.

hallucination
vision-language models
visual corruptions
distribution shift
test-time degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time training
CLIP guidance
vision-language models
hallucination mitigation
visual corruption robustness
πŸ”Ž Similar Papers
No similar papers found.
M
Mriganka Nath
Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
A
Anurag Das
Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
J
Jiahao Xie
Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
Bernt Schiele
Bernt Schiele
Professor, Max Planck Institute for Informatics, Saarland University, Saarland Informatics Campus
Computer VisionMachine LearningArtificial IntelligenceAutonomous DrivingScene Understanding