ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) for image segmentation predominantly rely on boundary points or task-specific segmentation heads, limiting their capacity to model fine-grained pixel-level structures. Method: We propose the first MLLM-based segmentation framework that integrates autoregressive image generation: images are tokenized via VQ-VAE, and the MLLM directly generates visual token sequences, which a generic decoder reconstructs into dense segmentation masks—eliminating discrete prompts and task-specific heads. A novel next-scale-prediction strategy is introduced to enhance token generation efficiency. Contribution/Results: Our approach unifies pixel-level reconstruction with multimodal semantic understanding. It achieves state-of-the-art performance across multiple segmentation benchmarks, significantly accelerates inference speed, and maintains strong semantic consistency without sacrificing mask fidelity.

Technology Category

Application Category

📝 Abstract

We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.

Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding and pixel-level perception in image segmentation

Overcoming limitations of discrete representations in segmentation methods

Reducing inference latency while maintaining strong visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive generation paradigm for image segmentation

Visual tokens detokenized into images via VQ-VAE

Next-scale-prediction strategy reduces inference latency

🔎 Similar Papers

No similar papers found.