🤖 AI Summary
This work proposes a single-stage, end-to-end framework based on YOLO to address the high latency of multi-stage methods for 6D object pose estimation from monocular RGB images. The approach introduces an auxiliary keypoint head to regress 2D projections of 3D bounding box corners and employs a continuous 9D representation combined with singular value decomposition (SVD) to enable stable and differentiable rotation regression. By integrating a keypoint enhancement mechanism and the 9D→SO(3) rotation representation into a single-stage detector, the method achieves a favorable balance between accuracy and efficiency. Evaluated on the LINEMOD and LINEMOD-Occluded datasets, it attains ADD(-S) 0.1d accuracies of 96.24% and 69.41%, respectively, while meeting real-time performance requirements.
📝 Abstract
Estimating the 6D pose of objects from a single RGB image is a critical task for robotics and extended reality applications. However, state-of-the-art multi stage methods often suffer from high latency, making them unsuitable for real time use. In this paper, we present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy. Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object's 3D bounding box corners. This keypoint detection task significantly improves the network's understanding of 3D geometry. For stable end-to-end training, we directly regress rotation using a continuous 9D representation projected to SO(3) via singular value decomposition. On the LINEMOD and LINEMOD-Occluded benchmarks, YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric, while proving itself to operate in real time. Our results demonstrate that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.