Near, far: Patch-ordering enhances vision foundation models' scene understanding

πŸ“… 2024-08-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient scene understanding by vision foundation models in non-parametric semantic segmentation, linear probing, and multi-view 3D understanding, this paper proposes NeCo (Patch Neighbor Consistency)β€”the first dense neighborhood consistency modeling method in self-supervised learning to incorporate differentiable ranking. Within a student-teacher framework and leveraging DINOv2 features, NeCo enforces consistency in the relative ordering of patch-level neighborhood features, replacing conventional binary contrastive losses and enabling efficient dense post-pretraining without additional parameters. NeCo achieves +5.5% and +6.0% mIoU gains on ADE20K and Pascal VOC for non-parametric segmentation, respectively; +7.2% and +5.7% improvements in linear probe performance on COCO semantic and instance segmentation; and over +1.5% improvement in multi-view 3D consistency on SPair-71k. These results demonstrate substantial enhancement in fine-grained scene understanding capabilities.

Technology Category

Application Category

πŸ“ Abstract
We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.
Problem

Research questions and friction points this paper is trying to address.

Enhances scene understanding
Self-supervised training loss
Improves dense feature encoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised patch neighbor consistency
differentiable sorting on pretrained representations
high-quality dense feature encoders
πŸ”Ž Similar Papers
No similar papers found.