One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenging problem of 6D pose estimation for arbitrary objects under three key constraints: absence of 3D models, unknown scale from single-view RGB input, and domain shift between generated and real data. We propose a coarse-to-fine joint scale-pose alignment framework coupled with a text-guided generative domain randomization strategy. Our method integrates multi-view feature matching, differentiable rendering-based optimization, and text-driven generative adversarial reconstruction, followed by fine-tuning on synthetic data to enhance generalization. Key contributions include: (1) the first formulation treating scale as an end-to-end optimizable variable within single-image 6D pose estimation; and (2) leveraging textual priors to guide domain randomization in the generative process, effectively mitigating domain gaps in both reconstruction and pose estimation. Our approach achieves state-of-the-art performance on YCBInEOAT, Toyota-Light, and LM-O benchmarks and demonstrates practical efficacy through successful deployment in dexterous robotic grasping tasks on physical hardware.

Technology Category

Application Category

📝 Abstract

Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: https://gzwsama.github.io/OnePoseviaGen.github.io/

Problem

Research questions and friction points this paper is trying to address.

Estimating 6D pose of unseen objects from single image

Addressing domain gaps between synthetic and real data

Enabling one-shot pose estimation without 3D models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine alignment module refines scale and pose

Text-guided generative domain randomization diversifies textures

High-fidelity single-view 3D generation supports pose estimation

🔎 Similar Papers

FreeZe: Training-Free Zero-Shot 6D Pose Estimation with Geometric and Vision Foundation Models

2023-12-01European Conference on Computer VisionCitations: 2

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)