Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limitation of existing visual backbone networks, which prioritize semantic invariance at the expense of sensitivity to millimeter-scale geometric changes—hindering high-precision closed-loop robotic control. To overcome this, the authors propose manifold distillation to transfer geometric priors from a frozen diffusion model into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN). This approach introduces diffusion-based geometric information into robotic control for the first time, decoupling the source of geometric knowledge from the inference process and effectively mitigating issues of stochasticity, latency, and representation drift. Pretrained on the large-scale DROID dataset, the method significantly outperforms mainstream discriminative baselines in both geometric consistency and control performance, highlighting the critical role of visual representation learning in enabling precise robotic action.

Technology Category

Application Category

📝 Abstract

We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a"blind spot"for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation drift during fine-tuning. To bridge this gap, we propose Robot-DIFT, a framework that decouples the source of geometric information from the process of inference via Manifold Distillation. By distilling a frozen diffusion teacher into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN), we retain the rich geometric priors of the generative model while ensuring temporal stability, real-time execution, and robustness against drift. Pretrained on the large-scale DROID dataset, Robot-DIFT demonstrates superior geometric consistency and control performance compared to leading discriminative baselines, supporting the view that how a model learns to see dictates how well it can learn to act.

Problem

Research questions and friction points this paper is trying to address.

geometric consistency

visuomotor control

visual backbone

robot manipulation

feature representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold Distillation

Diffusion Features

Geometric Consistency