Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot instance segmentation—pixel-level segmentation of novel object categories without access to category-specific annotated data. We propose OC-DiT, the first object-centric zero-shot segmentation framework leveraging a conditional latent diffusion model. OC-DiT jointly models visual object descriptors and local image features in latent space, enabling parallel coarse mask generation and fine-grained refinement. To support training, we construct a large-scale synthetic object-mask pairing dataset. Evaluated on real-world benchmarks including COCO and LVIS, OC-DiT achieves state-of-the-art performance without target-domain fine-tuning, demonstrating substantial improvements in cross-category generalization. Our results validate the effectiveness and practicality of diffusion models for open-world instance understanding.

Technology Category

Application Category

📝 Abstract
This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.
Problem

Research questions and friction points this paper is trying to address.

Develops diffusion models for zero-shot instance segmentation
Generates instance masks using object templates and image features
Achieves state-of-the-art performance without target data retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional latent diffusion for instance masks
Coarse and refinement model variants
Large-scale synthetic dataset training
🔎 Similar Papers
No similar papers found.