Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This paper addresses zero-shot 6D pose estimation for unseen objects from RGB-D inputs—without task-specific training. The method integrates frozen vision and geometry foundation models, employing sparse feature extraction, a feature-aware scoring mechanism, and a modular multi-model segmentation ensemble framework, coupled with RANSAC-based 3D registration for efficient matching. Key contributions are: (1) the first application of frozen multimodal foundation models to zero-shot 6D pose estimation; (2) a sparse representation and scoring strategy that jointly optimizes accuracy and inference speed; and (3) segmentation-based ensemble that significantly enhances robustness. Evaluated on the BOP Benchmark’s seven datasets, the approach achieves state-of-the-art performance—8× faster and 5% more accurate than prior methods; incorporating segmentation further improves accuracy by 8% while remaining 2.5× faster. It was awarded the Best Overall Method at BOP Challenge 2024.

Technology Category

Application Category

📝 Abstract

Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.

Problem

Research questions and friction points this paper is trying to address.

Achieving generalization to novel objects in 6D pose estimation

Reducing computational resources for accurate pose estimation

Improving efficiency and accuracy without task-specific training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained geometric and vision models

Uses sparse feature extraction for efficiency

Implements feature-aware scoring for accuracy

🔎 Similar Papers

FreeZe: Training-Free Zero-Shot 6D Pose Estimation with Geometric and Vision Foundation Models