🤖 AI Summary
This work addresses the limited cross-modal understanding of existing pretrained visual encoders—such as DINOv2—due to insufficient alignment of feature representations across modalities like RGB, depth, and segmentation. To overcome this, the authors propose an “omnivorous” visual encoder trained with a dual objective: first, maximizing feature consistency across multimodal inputs of the same scene, and second, distilling knowledge from a frozen DINOv2 teacher model to learn a unified, modality-agnostic semantic space. This approach achieves, for the first time, a single encoder capable of producing consistent and semantically rich features across diverse input modalities. The resulting model preserves DINOv2’s original discriminative performance while significantly enhancing cross-modal feature similarity, thereby enabling truly modality-invariant scene representations.
📝 Abstract
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes"omnivorous"by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.