Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

📅 2024-11-28
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Existing open-vocabulary segmentation (OVS) methods face a fundamental trade-off: vision-language models like CLIP achieve strong semantic alignment but yield coarse spatial localization due to global feature matching, whereas self-supervised models like DINOv2 provide fine-grained visual representations yet lack linguistic understanding. To resolve this, we propose a novel, fully unsupervised cross-modal alignment framework that requires no fine-tuning. Our approach explicitly couples CLIP’s text embeddings with DINOv2’s local patch features—guided by DINOv2’s attention maps—and introduces a learnable mapping function to jointly align semantics and spatial structure. Crucially, the method operates without category priors or any supervised annotations. Extensive experiments demonstrate substantial improvements in segmentation accuracy, segmentation naturalness, robustness to noise, and foreground-background discrimination. It achieves state-of-the-art performance across multiple unsupervised OVS benchmarks.

Technology Category

Application Category

📝 Abstract
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.
Problem

Research questions and friction points this paper is trying to address.

Bridging self-supervised vision and language for segmentation
Improving spatial localization in open-vocabulary segmentation
Combining DINOv2's accuracy with CLIP's language understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines DINOv2 spatial accuracy with CLIP language understanding
Aligns CLIP text to DINOv2 patches via learned mapping
Uses DINOv2 attention maps for selective patch-text alignment
🔎 Similar Papers
No similar papers found.