Near, far: Patch-ordering enhances vision foundation models' scene understanding

📅 2024-08-20

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

To address insufficient scene understanding by vision foundation models in non-parametric semantic segmentation, linear probing, and multi-view 3D understanding, this paper proposes NeCo (Patch Neighbor Consistency)—the first dense neighborhood consistency modeling method in self-supervised learning to incorporate differentiable ranking. Within a student-teacher framework and leveraging DINOv2 features, NeCo enforces consistency in the relative ordering of patch-level neighborhood features, replacing conventional binary contrastive losses and enabling efficient dense post-pretraining without additional parameters. NeCo achieves +5.5% and +6.0% mIoU gains on ADE20K and Pascal VOC for non-parametric segmentation, respectively; +7.2% and +5.7% improvements in linear probe performance on COCO semantic and instance segmentation; and over +1.5% improvement in multi-view 3D consistency on SPair-71k. These results demonstrate substantial enhancement in fine-grained scene understanding capabilities.

Technology Category

Application Category

📝 Abstract

We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.

Problem

Research questions and friction points this paper is trying to address.

Enhances scene understanding

Self-supervised training loss

Improves dense feature encoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised patch neighbor consistency

differentiable sorting on pretrained representations

high-quality dense feature encoders

🔎 Similar Papers

No similar papers found.