X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of data-efficient robot learning, where large vision models are difficult to train under data scarcity and small CNNs suffer from limited representational capacity. The authors propose an offline cross-architecture knowledge distillation approach that efficiently transfers the powerful visual representations learned by a frozen DINOv2 teacher model on ImageNet to a lightweight ResNet-18 student network. This distilled encoder is then jointly fine-tuned end-to-end with a diffusion policy head, without requiring 3D point clouds or large-scale vision-language models. The method significantly improves data efficiency and achieves state-of-the-art performance across 34 simulated tasks and 5 real-world manipulation tasks, outperforming both ResNet trained from scratch and fine-tuned DINOv2 encoders.

Technology Category

Application Category

📝 Abstract

Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

visuomotor learning

data scarcity

vision transformers

knowledge distillation

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-architecture distillation

vision transformers

knowledge distillation

visuomotor learning

data-efficient robotics

🔎 Similar Papers

No similar papers found.