Supervision-free Vision-Language Alignment

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current vision-language models (VLMs) heavily rely on large-scale, manually annotated image-text pairs, incurring prohibitive data curation costs. To address this, we propose the Self-supervised Vision Projection (SVP) framework—a fully unsupervised approach to vision-language alignment that requires neither paired annotations nor preference supervision. SVP leverages self-generated captions and pre-trained visual grounding models (e.g., GLIP, OWL-ViT) to provide implicit grounding feedback, integrating implicit gradient guidance with multi-task co-optimization to elicit intrinsic alignment capabilities. Experiments demonstrate that SVP improves captioning quality by 14% on average, boosts object recall by up to 12%, and substantially reduces hallucination rates. Notably, compact SVP models achieve hallucination suppression comparable to models five times larger in parameter count; moreover, models with weak referential capability double their performance and approach that of models twice their size.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Supervision-free Visual Projection), a novel framework that enhances vision-language alignment without relying on curated data or preference annotation. SVP leverages self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14% average improvement in captioning tasks, up to 12% increase in object recall, and substantial reduction in hallucination rates. Notably, a small VLM using SVP achieves hallucination reductions comparable to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Training Data

Annotation Cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

SVP

Unsupervised Visual Projection

Enhanced Visual Language Models

🔎 Similar Papers

Law of Vision Representation in MLLMs