ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing 6DoF object pose estimation methods that rely heavily on large-scale, object-specific training data. We propose the first fully training-free and model-free zero-shot approach. Our method leverages vision-language models (VLMs) to construct an open-vocabulary 3D concept graph and introduces a saliency-guided concept vector mapping mechanism to align textual semantics with 3D point clouds. This enables unsupervised 3D–3D correspondence matching for relative pose estimation. Crucially, no object modeling, fine-tuning, or annotations are required—only a single RGB image and a natural language description suffice for inference. Evaluated on a zero-shot relative pose benchmark, our method achieves state-of-the-art performance, improving the ADD(-S) metric by over 62% compared to prior approaches. This significantly breaks the traditional supervision paradigm’s dependence on task-specific data and custom model design.

Technology Category

Application Category

📝 Abstract
Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.
Problem

Research questions and friction points this paper is trying to address.

Estimates 6DoF object pose without training
Uses vision-language models for zero-shot capability
Creates 3D concept maps from saliency vectors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free object pose estimation using concept vectors
Open-vocabulary 3D concept maps from vision-language models
Robust 3D-3D correspondences for precise 6DoF pose estimation
🔎 Similar Papers
No similar papers found.