🤖 AI Summary
To address the limited robustness of 6D object pose estimation under occlusion and novel viewpoints, this paper proposes a retrieval-augmented multimodal pose estimation framework. Our method constructs a CAD knowledge base integrating multi-view rendered images and 3D point clouds, and introduces the ReSPC cross-modal matching module to achieve geometrically consistent alignment and pose refinement between query images and CAD models. It jointly leverages visual semantics and geometric priors through multimodal feature extraction, rendering-based 2D–3D alignment, retrieval-augmented decoding, and efficient CAD model retrieval. Evaluated on standard benchmarks—including LINEMOD and OCCLUSION—as well as real-world robotic grasping tasks, our approach achieves significant improvements in pose accuracy (average +8.2% ADD-S) and robustness under occlusion and viewpoint variation. This work establishes a generalizable paradigm for pose perception in robotic manipulation.
📝 Abstract
Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose .