🤖 AI Summary
This paper addresses the category-level 9-degree-of-freedom (9D) pose estimation of unseen objects from a single RGB image. We propose the first end-to-end, one-stage query-based framework for this task. Our core innovation lies in formulating 9D pose estimation as a natural extension of 2D detection: we design a lightweight pose head, a bounding-box-conditioned translation module, and introduce a 6D-aware Hungarian matching loss—enabling unified regression without pseudo-depth, CAD models, or multi-stage cascades. Built upon a Transformer-based detector, our method requires only RGB input and category labels during training. Evaluated on three major benchmarks including REAL275, it achieves new state-of-the-art performance: 79.6% IoU₅₀ and 54.1% accuracy under the 10°/10 cm metric. Our approach significantly outperforms existing pure-RGB methods and approaches the performance of RGB-D-based approaches.
📝 Abstract
Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $
m{IoU}_{50}$ and 54.1% under the $10^circ$$10{
m{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project.