🤖 AI Summary
This work addresses the challenge of integrating image segmentation with multimodal large language models (MLLMs). We propose a novel “text-as-mask” paradigm that reformulates segmentation as a text generation task, eliminating conventional segmentation decoders. Instead, we employ a lightweight semantic descriptor (16×16) to map image patches to structured textual labels. To enhance efficiency, we introduce row-wise run-length encoding (R-RLE), reducing token sequence length by 74% and accelerating inference threefold. End-to-end fine-tuning of MLLMs enables joint visual–linguistic modeling, achieving state-of-the-art performance on referring expression segmentation and comprehension benchmarks. Our approach significantly reduces computational overhead, improves model scalability and cross-task generalization, and establishes a practical, efficient pathway toward native pixel-level understanding in MLLMs.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with $16 imes16$ semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3 imes$, without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.