OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge of accurate 6D object pose estimation in open-set, dynamically changing environments where precise CAD models are often unavailable. The authors propose a training-free method for open-set CAD model retrieval that, given only a single RGB image and a textual prompt, retrieves matching models from an unlabeled 3D database. The approach leverages multi-view rendering, image captioning, and multimodal embeddings within a two-stage pipeline: an initial coarse filtering using CLIP followed by fine-grained reranking with DINOv2, while object detection is handled by GroundedSAM. This is the first method to enable training-free, open-set CAD retrieval with dynamic model library expansion, directly supporting downstream 6D pose estimation. Evaluated on the MI3DOR benchmark, it outperforms existing approaches and achieves a 90.48% average retrieval accuracy on YCB-Video, successfully integrating into Megapose with superior performance over reconstruction-based alternatives.

Technology Category

Application Category

📝 Abstract

6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.

Problem

Research questions and friction points this paper is trying to address.

6D object pose estimation

open-set CAD retrieval

object model retrieval

zero-shot pose estimation

cross-domain retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-set retrieval

zero-shot 6D pose estimation

multimodal embedding