One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of 6D pose estimation for arbitrary objects under three key constraints: absence of 3D models, unknown scale from single-view RGB input, and domain shift between generated and real data. We propose a coarse-to-fine joint scale-pose alignment framework coupled with a text-guided generative domain randomization strategy. Our method integrates multi-view feature matching, differentiable rendering-based optimization, and text-driven generative adversarial reconstruction, followed by fine-tuning on synthetic data to enhance generalization. Key contributions include: (1) the first formulation treating scale as an end-to-end optimizable variable within single-image 6D pose estimation; and (2) leveraging textual priors to guide domain randomization in the generative process, effectively mitigating domain gaps in both reconstruction and pose estimation. Our approach achieves state-of-the-art performance on YCBInEOAT, Toyota-Light, and LM-O benchmarks and demonstrates practical efficacy through successful deployment in dexterous robotic grasping tasks on physical hardware.

Technology Category

Application Category

📝 Abstract
Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: https://gzwsama.github.io/OnePoseviaGen.github.io/
Problem

Research questions and friction points this paper is trying to address.

Estimating 6D pose of unseen objects from single image
Addressing domain gaps between synthetic and real data
Enabling one-shot pose estimation without 3D models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine alignment module refines scale and pose
Text-guided generative domain randomization diversifies textures
High-fidelity single-view 3D generation supports pose estimation
🔎 Similar Papers
No similar papers found.
Z
Zheng Geng
Beijing Academy of Artificial Intelligence, BAAI
N
Nan Wang
Beijing Academy of Artificial Intelligence, BAAI
Shaocong Xu
Shaocong Xu
Xiamen University
open-set perceptionvision-language perceptiondiffusion-based perceptionmachine learning
Chongjie Ye
Chongjie Ye
The Chinese University of Hong Kong, Shenzhen
Computer Vision
B
Bohan Li
Shanghai Jiao Tong University;Eastern Institute of Technology, Ningbo
Zhaoxi Chen
Zhaoxi Chen
Ph.D. Student, Nanyang Technological University
Neural renderingGenerative models
Sida Peng
Sida Peng
Zhejiang University
Computer VisionComputer Graphics
H
Hao Zhao
Beijing Academy of Artificial Intelligence, BAAI;Institute for AI Industry Research (AIR), Tsinghua University;Tsinghua University