Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

๐Ÿ“… 2025-09-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current robotic manipulation relies predominantly on RGB vision, exhibiting limited generalization; in contrast, human interaction leverages intrinsic 3D geometric propertiesโ€”such as distance, size, and shape. While depth cameras provide geometric information, their raw outputs suffer from substantial noise and low accuracy, hindering robust real-world manipulation. To address this, we propose Camera Depth Models (CDMs), the first framework enabling seamless sim-to-real transfer without explicit noise injection or real-world fine-tuning. CDMs jointly process RGB images and raw depth maps, leveraging a neural data engine to synthesize high-fidelity paired training data that faithfully reproduces realistic noise distributions. A deep learning architecture then performs joint denoising and depth completion, yielding metrically accurate, high-fidelity depth estimates. On physical robots, CDMs achieve depth prediction accuracy comparable to simulation, enabling policies to generalize nearly losslessly across domains for challenging manipulation tasks involving diverse, hard-to-grasp objects.

Technology Category

Application Category

๐Ÿ“ Abstract
Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.
Problem

Research questions and friction points this paper is trying to address.

Improving depth camera accuracy for robot manipulation tasks
Bridging sim-to-real gap in geometric perception for robotics
Enhancing generalization of manipulation policies using 3D information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Camera Depth Models plugin for depth cameras
Neural data engine generates simulation noise data
Sim-to-real generalization without fine-tuning or noise
๐Ÿ”Ž Similar Papers
No similar papers found.