ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address multimodal collapse and physical inconsistency arising from heterogeneity between low-frequency global vision and high-frequency local force sensing in contact-intensive dexterous manipulation, this paper proposes an end-to-end vision–force diffusion policy. Our method jointly models global task planning and closed-loop high-frequency force feedback, eliminating hierarchical design. Key contributions include: (1) a structured fast-slow learning mechanism that achieves cross-modal temporal alignment and token fusion via causal asynchronous attention; and (2) virtual-goal-driven representation regularization, which enhances physical consistency by mapping force signals into an implicit action space. Evaluated on multiple contact-intensive manipulation tasks, our approach significantly outperforms vision-only and hierarchical baselines—achieving a 12.7% improvement in success rate and a 38% reduction in response latency—while employing a simpler, more streamlined training pipeline.

Technology Category

Application Category

📝 Abstract
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
Problem

Research questions and friction points this paper is trying to address.

Integrates vision and force sensing for manipulation tasks
Addresses modality collapse in end-to-end multimodal learning
Enables closed-loop force control with visual planning coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual-force diffusion policy with end-to-end network
Structural Slow-Fast Learning using causal attention for asynchronous tokens
Virtual-target-based Representation Regularization for physics-grounded force feedback
🔎 Similar Papers
No similar papers found.
Wendi Chen
Wendi Chen
Ph.D. Student, Shanghai Jiao Tong University
Robot LearningEmbodied AIMachine Learning
H
Han Xue
Shanghai Jiao Tong University
Y
Yi Wang
Shanghai Jiao Tong University
F
Fangyuan Zhou
Shanghai Jiao Tong University
Jun Lv
Jun Lv
Shanghai Jiao Tong University
Embodied AIRobot LearningArtificial Intelligence
Y
Yang Jin
Shanghai Jiao Tong University
S
Shirun Tang
Noematrix Ltd.
Chuan Wen
Chuan Wen
Shanghai Jiao Tong University
RoboticsMachine LearningComputer Vision
C
Cewu Lu
Shanghai Jiao Tong University