ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) for image segmentation predominantly rely on boundary points or task-specific segmentation heads, limiting their capacity to model fine-grained pixel-level structures. Method: We propose the first MLLM-based segmentation framework that integrates autoregressive image generation: images are tokenized via VQ-VAE, and the MLLM directly generates visual token sequences, which a generic decoder reconstructs into dense segmentation masks—eliminating discrete prompts and task-specific heads. A novel next-scale-prediction strategy is introduced to enhance token generation efficiency. Contribution/Results: Our approach unifies pixel-level reconstruction with multimodal semantic understanding. It achieves state-of-the-art performance across multiple segmentation benchmarks, significantly accelerates inference speed, and maintains strong semantic consistency without sacrificing mask fidelity.

Technology Category

Application Category

📝 Abstract
We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding and pixel-level perception in image segmentation
Overcoming limitations of discrete representations in segmentation methods
Reducing inference latency while maintaining strong visual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive generation paradigm for image segmentation
Visual tokens detokenized into images via VQ-VAE
Next-scale-prediction strategy reduces inference latency
🔎 Similar Papers
No similar papers found.
X
Xiaolong Wang
Ant Group
Lixiang Ru
Lixiang Ru
Ant Group
computer visionMLLMmulti-modal learningremote sensing
Z
Ziyuan Huang
Ant Group
Kaixiang Ji
Kaixiang Ji
Ant Group
Computer VisionMultimodal
D
Dandan Zheng
Ant Group
J
Jingdong Chen
Ant Group
J
Jun Zhou
Ant Group