RealD$^2$iff: Bridging Real-World Gap in Robot Manipulation via Depth Diffusion

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world robotic manipulation suffers from the sim-to-real gap in depth perception, primarily due to the absence of realistic sensor noise in simulated depth images. Method: We propose a novel “clean-to-noisy” reverse diffusion paradigm that synthesizes photorealistic noisy depth maps directly from noise-free simulation data—without requiring any real-world depth acquisition. Our approach employs a hierarchical coarse-to-fine diffusion architecture jointly supervised by frequency-domain guidance (FGS) and discrepancy-guided optimization (DGO), enabling layered modeling of global structural distortions and local fine-grained perturbations. Integrated into a six-stage imitation learning pipeline, it automatically constructs high-fidelity paired clean-noisy depth datasets. Results: The framework achieves zero-shot sim-to-real transfer on real-robot grasping and manipulation tasks, significantly improving generalization. It provides a scalable, calibration-free solution to the sim-to-real challenge in vision-based depth perception.

Technology Category

Application Category

📝 Abstract
Robot manipulation in the real world is fundamentally constrained by the visual sim2real gap, where depth observations collected in simulation fail to reflect the complex noise patterns inherent to real sensors. In this work, inspired by the denoising capability of diffusion models, we invert the conventional perspective and propose a clean-to-noisy paradigm that learns to synthesize noisy depth, thereby bridging the visual sim2real gap through purely simulation-driven robotic learning. Building on this idea, we introduce RealD$^2$iff, a hierarchical coarse-to-fine diffusion framework that decomposes depth noise into global structural distortions and fine-grained local perturbations. To enable progressive learning of these components, we further develop two complementary strategies: Frequency-Guided Supervision (FGS) for global structure modeling and Discrepancy-Guided Optimization (DGO) for localized refinement. To integrate RealD$^2$iff seamlessly into imitation learning, we construct a pipeline that spans six stages. We provide comprehensive empirical and experimental validation demonstrating the effectiveness of this paradigm. RealD$^2$iff enables two key applications: (1) generating real-world-like depth to construct clean-noisy paired datasets without manual sensor data collection. (2) Achieving zero-shot sim2real robot manipulation, substantially improving real-world performance without additional fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Bridging visual sim2real gap in robot manipulation
Synthesizing noisy depth via clean-to-noisy diffusion paradigm
Enabling zero-shot sim2real transfer without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical diffusion framework synthesizes noisy depth from clean simulation data
Frequency-guided supervision models global structural distortions in depth noise
Discrepancy-guided optimization refines local perturbations for realistic sensor simulation
🔎 Similar Papers
No similar papers found.
X
Xiujian Liang
SII, FDU
J
Jiacheng Liu
SII, WU
M
Mingyang Sun
SII, WU
Q
Qichen He
SII, SJTU
C
Cewu Lu
SII, SJTU
Jianhua Sun
Jianhua Sun
Shanghai Jiao Tong University
Computer VisionRobot Learning