DINOv3

📅 2025-08-13
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
This work addresses the construction of general-purpose vision foundation models, aiming to eliminate reliance on manual annotations while enabling cross-domain (natural and aerial imagery) representation learning and scalable training over large-scale data and model sizes. To mitigate dense feature map degradation during long-horizon training, we propose a Gram anchoring mechanism; additionally, a multi-scale post-adaptation strategy is introduced to enhance flexibility in resolution handling, parameter scaling, and text–image alignment. Our method builds upon a self-supervised contrastive learning framework, integrating large-scale heterogeneous data preprocessing, feature map regularization, and lightweight post-processing techniques. The resulting model achieves state-of-the-art performance across classification, detection, and segmentation tasks—without task-specific fine-tuning—outperforming domain-specialized SOTA methods. Moreover, its dense feature representations exhibit substantially higher quality than those of existing self-supervised and weakly supervised foundation models.

Technology Category

Application Category

📝 Abstract
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.
Problem

Research questions and friction points this paper is trying to address.

Eliminate manual data annotation with self-supervised learning
Address dense feature map degradation during long training
Enhance model flexibility in resolution and text alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling dataset and model size effectively
Introducing Gram anchoring for stable training
Enhancing model flexibility post-training
🔎 Similar Papers
No similar papers found.
Oriane Siméoni
Oriane Siméoni
Meta FAIR
Computer vision
Huy V. Vo
Huy V. Vo
Researcher at Meta FAIR
Compute visionmachine learning
Maximilian Seitzer
Maximilian Seitzer
Meta FAIR
Federico Baldassarre
Federico Baldassarre
Meta AI postdoc
Self-supervised learningVideo world modelsExplainability and reasoning
Maxime Oquab
Maxime Oquab
Facebook AI Research
Deep LearningComputer Vision
C
Cijo Jose
Meta AI Research
Vasil Khalidov
Vasil Khalidov
Meta AI
computer visionself-supervised learninggenerative AI
Marc Szafraniec
Marc Szafraniec
Research Engineer, Facebook AI Research
Artificial IntelligenceDeep Learning
S
Seungeun Yi
Meta AI Research
Michaël Ramamonjisoa
Michaël Ramamonjisoa
Meta AI, FAIR
Computer VisionMachine LearningImage ProcessingDeep Learning
Francisco Massa
Francisco Massa
Research Engineer at Facebook AI Research
Artificial IntelligenceComputer VisionMachine Learning
Daniel Haziza
Daniel Haziza
Facebook AI Research (FAIR)
Luca Wehrstedt
Luca Wehrstedt
Facebook
Jianyuan Wang
Jianyuan Wang
Oxford Visual Geometry Group & FAIR
Timothée Darcet
Timothée Darcet
PhD student, Meta AI and Inria
deep learningcomputer vision
Théo Moutakanni
Théo Moutakanni
Meta - FAIR, Université Paris-Saclay - CentraleSupélec - MICS
Deep Learning
L
Leonel Sentana
Meta AI Research
C
Claire Roberts
Meta AI Research
Andrea Vedaldi
Andrea Vedaldi
University of Oxford
Computer VisionMachine Learning
J
Jamie Tolan
Meta AI Research
J
John Brandt
Meta AI Research
Camille Couprie
Camille Couprie
Research scientist at Facebook AI Research
Optimizationgraphsimage processingcomputer visionmachine learning
Julien Mairal
Julien Mairal
Inria - Univ. Grenoble Alpes
machine learningartificial intelligenceoptimizationcomputer visionimage processing
Hervé Jégou
Hervé Jégou
FAIR
IndexingMachine LearningArtificial IntelligenceComputer Vision
Patrick Labatut
Patrick Labatut
Meta
Computer VisionComputer GraphicsMachine Learning