MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the challenges of localizing unseen categories and achieving robust cross-modal matching in zero-shot 2D object detection and segmentation, this paper proposes MUSE—a training-free, model-driven framework. MUSE leverages multi-view 2D renderings of 3D unseen objects as templates and performs cross-modal matching against candidate regions extracted from query images. It introduces a novel joint similarity metric—integrating both absolute and relative similarities—and incorporates uncertainty-aware object priors. Furthermore, it employs class-token embedding fusion and generalized mean pooling (GeM) to calibrate candidate region reliability. Evaluated on the BOP Challenge 2025, MUSE achieves state-of-the-art performance in zero-shot detection and segmentation, securing first place across all tracks: Classic Core, H3, and Industrial.

Technology Category

Application Category

📝 Abstract

In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot 2D object detection and segmentation without training

Robust similarity matching under challenging scenarios

Uncertainty-aware refinement for proposal reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for zero-shot detection and segmentation

Integrates class and patch embeddings with GeM pooling

Uses joint similarity metric and uncertainty-aware object prior

🔎 Similar Papers

A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models