DINOv3-Diffusion Policy: Self-Supervised Large Visual Model for Visuomotor Diffusion Policy Learning

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the effectiveness of the large-scale self-supervised vision backbone DINOv3 in diffusion-based robotic manipulation policy learning. We systematically compare its performance—when trained from scratch, frozen, or fine-tuned—against supervised ResNet-18. Notably, this is the first study to integrate a purely self-supervised pre-trained model (trained without ImageNet labels) into a diffusion policy framework: DINOv3 serves as the visual encoder, coupled with a FiLM-conditioned diffusion network for end-to-end visuomotor policy learning. Experiments show an absolute 10% success rate improvement on the Can task; comparable performance on Lift, PushT, and Square tasks; and significantly enhanced robustness. The core contribution is the empirical validation that high-quality self-supervised representations can effectively replace supervised pretraining, offering superior sample efficiency and generalization. This advances embodied intelligence by reducing reliance on labeled data and establishing a new paradigm for low-annotation-demand robotic learning.

Technology Category

Application Category

📝 Abstract
This paper evaluates DINOv3, a recent large-scale self-supervised vision backbone, for visuomotor diffusion policy learning in robotic manipulation. We investigate whether a purely self-supervised encoder can match or surpass conventional supervised ImageNet-pretrained backbones (e.g., ResNet-18) under three regimes: training from scratch, frozen, and finetuned. Across four benchmark tasks (Push-T, Lift, Can, Square) using a unified FiLM-conditioned diffusion policy, we find that (i) finetuned DINOv3 matches or exceeds ResNet-18 on several tasks, (ii) frozen DINOv3 remains competitive, indicating strong transferable priors, and (iii) self-supervised features improve sample efficiency and robustness. These results support self-supervised large visual models as effective, generalizable perceptual front-ends for action diffusion policies, motivating further exploration of scalable label-free pretraining in robotic manipulation. Compared to using ResNet18 as a backbone, our approach with DINOv3 achieves up to a 10% absolute increase in test-time success rates on challenging tasks such as Can, and on-the-par performance in tasks like Lift, PushT, and Square.
Problem

Research questions and friction points this paper is trying to address.

Evaluating self-supervised DINOv3 for robotic visuomotor policy learning
Comparing self-supervised versus supervised ImageNet backbones for manipulation
Assessing transferable visual priors for diffusion policy generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-supervised DINOv3 as visual backbone
Applies diffusion policy learning for robot manipulation
Achieves improved sample efficiency and robustness
🔎 Similar Papers
No similar papers found.
T
ThankGod Egbe
Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, M15 6BH, UK
P
Peng Wang
Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, M15 6BH, UK
Zhihao Guo
Zhihao Guo
Manchester Metropolitan University
3D ReconstructionComputer vision
Z
Zidong Chen
Department of Computing, Imperial College London, London, SW7 2AZ, UK