🤖 AI Summary
This work addresses the complexity and reliance on auxiliary decoders when integrating image segmentation into multimodal large language models (MLLMs). We propose a novel “text-as-mask” paradigm: segmentation masks are encoded as lightweight semantic descriptors—spanning both image-level and bounding-box-level representations—and further compressed via a semantic brick mechanism and row-wise run-length encoding (R-RLE). Structured segmentation text is generated end-to-end through standard language modeling, eliminating the need for task-specific decoders. Our approach is fully compatible with mainstream MLLM backbones and requires no segmentation-specific fine-tuning. Evaluated on diverse natural and remote-sensing image benchmarks, it surpasses state-of-the-art methods in accuracy while achieving a 3× inference speedup and 74% text compression ratio. The method demonstrates superior accuracy, efficiency, and scalability.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3 imes$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.