🤖 AI Summary
To address the challenges of localizing unseen categories and achieving robust cross-modal matching in zero-shot 2D object detection and segmentation, this paper proposes MUSE—a training-free, model-driven framework. MUSE leverages multi-view 2D renderings of 3D unseen objects as templates and performs cross-modal matching against candidate regions extracted from query images. It introduces a novel joint similarity metric—integrating both absolute and relative similarities—and incorporates uncertainty-aware object priors. Furthermore, it employs class-token embedding fusion and generalized mean pooling (GeM) to calibrate candidate region reliability. Evaluated on the BOP Challenge 2025, MUSE achieves state-of-the-art performance in zero-shot detection and segmentation, securing first place across all tracks: Classic Core, H3, and Industrial.
📝 Abstract
In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.