π€ AI Summary
Current vision-language models (VLMs) heavily rely on large-scale, manually annotated image-text pairs, incurring prohibitive data curation costs. To address this, we propose the Self-supervised Vision Projection (SVP) frameworkβa fully unsupervised approach to vision-language alignment that requires neither paired annotations nor preference supervision. SVP leverages self-generated captions and pre-trained visual grounding models (e.g., GLIP, OWL-ViT) to provide implicit grounding feedback, integrating implicit gradient guidance with multi-task co-optimization to elicit intrinsic alignment capabilities. Experiments demonstrate that SVP improves captioning quality by 14% on average, boosts object recall by up to 12%, and substantially reduces hallucination rates. Notably, compact SVP models achieve hallucination suppression comparable to models five times larger in parameter count; moreover, models with weak referential capability double their performance and approach that of models twice their size.
π Abstract
Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Supervision-free Visual Projection), a novel framework that enhances vision-language alignment without relying on curated data or preference annotation. SVP leverages self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14% average improvement in captioning tasks, up to 12% increase in object recall, and substantial reduction in hallucination rates. Notably, a small VLM using SVP achieves hallucination reductions comparable to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.