🤖 AI Summary
This paper addresses zero-shot 6D pose estimation for unseen objects from RGB-D inputs—without task-specific training. The method integrates frozen vision and geometry foundation models, employing sparse feature extraction, a feature-aware scoring mechanism, and a modular multi-model segmentation ensemble framework, coupled with RANSAC-based 3D registration for efficient matching. Key contributions are: (1) the first application of frozen multimodal foundation models to zero-shot 6D pose estimation; (2) a sparse representation and scoring strategy that jointly optimizes accuracy and inference speed; and (3) segmentation-based ensemble that significantly enhances robustness. Evaluated on the BOP Benchmark’s seven datasets, the approach achieves state-of-the-art performance—8× faster and 5% more accurate than prior methods; incorporating segmentation further improves accuracy by 8% while remaining 2.5× faster. It was awarded the Best Overall Method at BOP Challenge 2024.
📝 Abstract
Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.