🤖 AI Summary
This work addresses the challenging problem of 6D pose estimation for arbitrary objects under three key constraints: absence of 3D models, unknown scale from single-view RGB input, and domain shift between generated and real data. We propose a coarse-to-fine joint scale-pose alignment framework coupled with a text-guided generative domain randomization strategy. Our method integrates multi-view feature matching, differentiable rendering-based optimization, and text-driven generative adversarial reconstruction, followed by fine-tuning on synthetic data to enhance generalization. Key contributions include: (1) the first formulation treating scale as an end-to-end optimizable variable within single-image 6D pose estimation; and (2) leveraging textual priors to guide domain randomization in the generative process, effectively mitigating domain gaps in both reconstruction and pose estimation. Our approach achieves state-of-the-art performance on YCBInEOAT, Toyota-Light, and LM-O benchmarks and demonstrates practical efficacy through successful deployment in dexterous robotic grasping tasks on physical hardware.
📝 Abstract
Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: https://gzwsama.github.io/OnePoseviaGen.github.io/