ZIM: Zero-Shot Image Matting for Anything

📅 2024-11-01

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

206K/year

🤖 AI Summary

Existing zero-shot image segmentation models (e.g., SAM) yield coarse segmentation masks unsuitable for high-fidelity alpha matte estimation, while conventional matting methods rely on class-specific annotations and thus lack generalizability. To bridge this gap, we propose the first zero-shot image matting framework. Our approach comprises three key components: (1) constructing SA1B-Matte, a large-scale zero-shot matting dataset derived from SA1B; (2) introducing an automatic label conversion pipeline that transforms coarse segmentation masks into pixel-accurate alpha mattes; and (3) designing a hierarchical pixel decoder and prompt-aware mask attention mechanism to enable end-to-end transparency estimation atop the SAM architecture. Evaluated on our newly curated MicroMat-3K benchmark, our method significantly outperforms state-of-the-art approaches. Moreover, it demonstrates strong transferability to downstream tasks including image inpainting and 3D NeRF reconstruction. Code is publicly available.

Technology Category

Application Category

📝 Abstract

The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code is available at url{https://github.com/naver-ai/ZIM}.

Problem

Research questions and friction points this paper is trying to address.

Generating fine-grained precise masks from segmentation labels

Converting segmentation labels into detailed matte annotations automatically

Enhancing zero-shot matting performance with prompt-aware attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Label converter creates matte dataset automatically

Hierarchical pixel decoder enhances mask representation

Prompt-aware masked attention focuses on visual prompts

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)