One2Any: One-Reference 6D Pose Estimation for Any Object

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenging problem of 6D pose estimation for arbitrary objects from a single RGB-D reference image, under zero-shot, CAD-free, and multi-view-free conditions. We propose ROPE-U-Net, a novel framework featuring the first Reference Object Pose Embedding (ROPE) encoder, which implicitly encodes a single RGB-D reference image as a generalizable pose prior. Integrated with a U-Net architecture, ROPE-U-Net enables end-to-end RGB-D feature alignment and relative pose decoding. Crucially, the method requires no 3D CAD models, category-level annotations, or multi-view supervision, supporting cross-object generalization and scalable pairwise training. Evaluated on multiple standard benchmarks, it achieves state-of-the-art accuracy and robustness—comparable to advanced methods relying on CAD models or multi-view inputs—while significantly reducing computational overhead.

Technology Category

Application Category

📝 Abstract

6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image, without prior knowledge of its 3D model, multi-view data, or category constraints. We treat object pose estimation as an encoding-decoding process, first, we obtain a comprehensive Reference Object Pose Embedding (ROPE) that encodes an object shape, orientation, and texture from a single reference view. Using this embedding, a U-Net-based pose decoding module produces Reference Object Coordinate (ROC) for new views, enabling fast and accurate pose estimation. This simple encoding-decoding framework allows our model to be trained on any pair-wise pose data, enabling large-scale training and demonstrating great scalability. Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and robustness even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.

Problem

Research questions and friction points this paper is trying to address.

Estimates 6D pose using single RGB-D image

Eliminates need for 3D models or multi-view data

Generalizes to novel objects without category constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single reference-view RGB-D pose estimation

Encoding-decoding framework for object pose

U-Net-based ROC generation for new views

🔎 Similar Papers

No similar papers found.