AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces a novel paradigm for single-view, model-free, and correspondence-free 6D object pose estimation—eliminating reliance on CAD models, multi-stage regression, 2D–3D feature matching, or geometric priors such as depth, SfM, or PnP. Methodologically, it proposes the first Axis Generation (AG) framework: a diffusion model explicitly learns the joint distribution of three orthogonal axes; an Axis Generation Module (AGM) and a Tri-axis Back-projection Module (TBM) are designed to jointly recover pose via geometric consistency-aware gradient injection and noise-prediction optimization. Crucially, the method operates solely on a single RGB image, requiring no reference images or appearance priors. Evaluated across multiple benchmarks, it demonstrates strong cross-instance generalization, significantly improving deployment efficiency and robustness. This work establishes a scalable, open-world solution for 6D pose estimation.

Technology Category

Application Category

📝 Abstract
Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.
Problem

Research questions and friction points this paper is trying to address.

Single-shot 6D object pose estimation without multi-stage regression.
Eliminates reliance on 2D-3D feature matching and complex input pipelines.
Generalizes to unseen objects using a single view without reference images.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-shot 6D pose estimation without reference views
Diffusion model learns latent axis distribution
Triaxial back-projection recovers 6D pose
🔎 Similar Papers
No similar papers found.
Y
Yang Zou
Northwestern Polytechnical University
Zhaoshuai Qi
Zhaoshuai Qi
Northwestern Polytechnical University
3D computer visionPose estimation
Y
Yating Liu
Northwestern Polytechnical University
Z
Zihao Xu
Dalian University of Technology
W
Weipeng Sun
Dalian University of Technology
Weiyi Liu
Weiyi Liu
Purdue University
Machine Learning
X
Xingyuan Li
Dalian University of Technology
J
Jiaqi Yang
Northwestern Polytechnical University
Yanning Zhang
Yanning Zhang
Northwestern Polytechnical University
Computer Vision