🤖 AI Summary
This paper addresses category-level 6D pose estimation and detection from a single RGB image. We propose the first end-to-end unified framework that requires neither RGB-D input nor a two-stage pipeline. Methodologically, we jointly model detection and pose estimation within a single network: (i) a differentiable neural mesh representation for a 3D prototype library; (ii) feature-alignment-driven differentiable rendering; and (iii) a multi-model RANSAC optimizer enabling cross-task collaborative optimization. By eliminating reliance on depth data and avoiding error propagation from post-processing, our approach significantly improves robustness and generalization. On the REAL275 benchmark, our method achieves state-of-the-art performance, outperforming prior art by 22.9% in average scale-invariant metric (AUC), setting a new performance ceiling for category-level 6D pose estimation from monocular RGB images.
📝 Abstract
Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at https://github.com/Fischer-Tom/unified-detection-and-pose-estimation.