X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of data-efficient robot learning, where large vision models are difficult to train under data scarcity and small CNNs suffer from limited representational capacity. The authors propose an offline cross-architecture knowledge distillation approach that efficiently transfers the powerful visual representations learned by a frozen DINOv2 teacher model on ImageNet to a lightweight ResNet-18 student network. This distilled encoder is then jointly fine-tuned end-to-end with a diffusion policy head, without requiring 3D point clouds or large-scale vision-language models. The method significantly improves data efficiency and achieves state-of-the-art performance across 34 simulated tasks and 5 real-world manipulation tasks, outperforming both ResNet trained from scratch and fine-tuned DINOv2 encoders.

Technology Category

Application Category

📝 Abstract
Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

visuomotor learning
data scarcity
vision transformers
knowledge distillation
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-architecture distillation
vision transformers
knowledge distillation
visuomotor learning
data-efficient robotics
🔎 Similar Papers
No similar papers found.
M
Maanping Shao
Tsinghua University, Beijing 100084, China
F
Feihong Zhang
Tsinghua University, Beijing 100084, China
Gu Zhang
Gu Zhang
Tsinghua University
RoboticsRobot Learning
B
Baiye Cheng
Huazhong University of Science and Technology, Wuhan 430074, China
Zhengrong Xue
Zhengrong Xue
IIIS, Tsinghua University
Robot LearningRobotic Manipulation
Huazhe Xu
Huazhe Xu
Tsinghua University
Embodied AIReinforcement LearningComputer VisionDeep Learning