Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models like CLIP excel at cross-modal alignment but suffer from limited fine-grained visual perception, hindering downstream multimodal large language models (MLLMs). To address this, we propose a kernel-based unsupervised visual embedding alignment method that aligns the CLIP visual encoder with the high-detail-aware DINOv2 encoder, while keeping CLIP’s text encoder frozen—requiring no textual supervision or image-text pairs. Our approach introduces a radial basis function (RBF) kernel-driven alignment framework in embedding space, enabling efficient stochastic optimization. Experiments demonstrate substantial improvements on zero-shot object recognition, fine-grained spatial reasoning, and localization tasks. When integrated into MLLMs such as LLaVA, our aligned visual encoder consistently enhances downstream multimodal understanding performance across diverse benchmarks, validating its generalizability and effectiveness without compromising CLIP’s pretrained linguistic capabilities.

Technology Category

Application Category

📝 Abstract
Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance.
Problem

Research questions and friction points this paper is trying to address.

Align CLIP visual representation with DINOv2 for better perception
Enhance fine-grained details in vision-language model embeddings
Improve zero-shot recognition and spatial reasoning in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel-based alignment of CLIP and DINOv2
Enhances fine-grained visual perception
Maintains text embedding compatibility
🔎 Similar Papers
No similar papers found.
Shizhan Gong
Shizhan Gong
The Chinese University of Hong Kong
explainable computer visionmultimodal learningmedical imaging analysis
Y
Yankai Jiang
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Q
Qi Dou
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
Farzan Farnia
Farzan Farnia
Assistant Professor, Chinese University of Hong Kong
Machine LearningOptimizationInformation Theory