Text4Seg++: Advancing Image Segmentation via Generative Language Modeling

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the complexity and reliance on auxiliary decoders when integrating image segmentation into multimodal large language models (MLLMs). We propose a novel “text-as-mask” paradigm: segmentation masks are encoded as lightweight semantic descriptors—spanning both image-level and bounding-box-level representations—and further compressed via a semantic brick mechanism and row-wise run-length encoding (R-RLE). Structured segmentation text is generated end-to-end through standard language modeling, eliminating the need for task-specific decoders. Our approach is fully compatible with mainstream MLLM backbones and requires no segmentation-specific fine-tuning. Evaluated on diverse natural and remote-sensing image benchmarks, it surpasses state-of-the-art methods in accuracy while achieving a 3× inference speedup and 74% text compression ratio. The method demonstrates superior accuracy, efficiency, and scalability.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3 imes$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.
Problem

Research questions and friction points this paper is trying to address.

Integrating image segmentation into multimodal language models
Eliminating need for additional decoders for segmentation
Improving segmentation precision and efficiency through text representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-as-mask paradigm for segmentation
Semantic descriptors represent masks textually
Row-wise RLE encoding accelerates inference significantly
🔎 Similar Papers
No similar papers found.