🤖 AI Summary
Existing multimodal large language models (MLLMs) for image segmentation predominantly rely on boundary points or task-specific segmentation heads, limiting their capacity to model fine-grained pixel-level structures.
Method: We propose the first MLLM-based segmentation framework that integrates autoregressive image generation: images are tokenized via VQ-VAE, and the MLLM directly generates visual token sequences, which a generic decoder reconstructs into dense segmentation masks—eliminating discrete prompts and task-specific heads. A novel next-scale-prediction strategy is introduced to enhance token generation efficiency.
Contribution/Results: Our approach unifies pixel-level reconstruction with multimodal semantic understanding. It achieves state-of-the-art performance across multiple segmentation benchmarks, significantly accelerates inference speed, and maintains strong semantic consistency without sacrificing mask fidelity.
📝 Abstract
We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.