You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This paper addresses the category-level 9-degree-of-freedom (9D) pose estimation of unseen objects from a single RGB image. We propose the first end-to-end, one-stage query-based framework for this task. Our core innovation lies in formulating 9D pose estimation as a natural extension of 2D detection: we design a lightweight pose head, a bounding-box-conditioned translation module, and introduce a 6D-aware Hungarian matching loss—enabling unified regression without pseudo-depth, CAD models, or multi-stage cascades. Built upon a Transformer-based detector, our method requires only RGB input and category labels during training. Evaluated on three major benchmarks including REAL275, it achieves new state-of-the-art performance: 79.6% IoU₅₀ and 54.1% accuracy under the 10°/10 cm metric. Our approach significantly outperforms existing pure-RGB methods and approaches the performance of RGB-D-based approaches.

Technology Category

Application Category

📝 Abstract

Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $ m{IoU}_{50}$ and 54.1% under the $10^circ$$10{ m{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project.

Problem

Research questions and friction points this paper is trying to address.

Estimating 9-DoF object poses from single RGB images

Unifying object detection and pose estimation without additional data

Eliminating dependency on depth sensors or CAD models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage transformer framework for pose estimation

Lightweight pose head with bounding-box translation module

6D-aware Hungarian matching cost for end-to-end training

🔎 Similar Papers

No similar papers found.