🤖 AI Summary
This work addresses the challenge of robust object pose estimation and in-hand reorientation for dexterous hands under complex lighting and dynamic conditions using monocular RGB vision. The authors propose the first visual sim-to-real transfer framework based on 3D Gaussian Splatting (3DGS), which performs physically consistent pre-rendering augmentation in Gaussian representation space to generate photorealistic and diverse training data. By integrating curriculum reinforcement learning with teacher-student distillation, the method trains perception and control modules independently on consumer-grade hardware—without requiring multi-view cameras or ray tracing—significantly improving monocular pose estimation accuracy. Experiments demonstrate that the system achieves robust reorientation on a real multi-fingered hand across five object categories and challenging illumination conditions, outperforming conventional rendering-based approaches.
📝 Abstract
In-hand object reorientation requires precise estimation of the object pose to handle complex task dynamics. While RGB sensing offers rich semantic cues for pose tracking, existing solutions rely on multi-camera setups or costly ray tracing. We present a sim-to-real framework for monocular RGB in-hand reorientation that integrates 3D Gaussian Splatting (3DGS) to bridge the visual sim-to-real gap. Our key insight is performing domain randomization in the Gaussian representation space: by applying physically consistent, pre-rendering augmentations to 3D Gaussians, we generate photorealistic, randomized visual data for object pose estimation. The manipulation policy is trained using curriculum-based reinforcement learning with teacher-student distillation, enabling efficient learning of complex behaviors. Importantly, both perception and control models can be trained independently on consumer-grade hardware, eliminating the need for large compute clusters. Experiments show that the pose estimator trained with 3DGS data outperforms those trained using conventional rendering data in challenging visual environments. We validate the system on a physical multi-fingered hand equipped with an RGB camera, demonstrating robust reorientation of five diverse objects even under challenging lighting conditions. Our results highlight Gaussian splatting as a practical path for RGB-only dexterous manipulation. For videos of the hardware deployments and additional supplementary materials, please refer to the project website: https://rffr.leggedrobotics.com/works/viserdex/