Text4Seg: Reimagining Image Segmentation as Text Generation

📅 2024-10-13

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of integrating image segmentation with multimodal large language models (MLLMs). We propose a novel “text-as-mask” paradigm that reformulates segmentation as a text generation task, eliminating conventional segmentation decoders. Instead, we employ a lightweight semantic descriptor (16×16) to map image patches to structured textual labels. To enhance efficiency, we introduce row-wise run-length encoding (R-RLE), reducing token sequence length by 74% and accelerating inference threefold. End-to-end fine-tuning of MLLMs enables joint visual–linguistic modeling, achieving state-of-the-art performance on referring expression segmentation and comprehension benchmarks. Our approach significantly reduces computational overhead, improves model scalability and cross-task generalization, and establishes a practical, efficient pathway toward native pixel-level understanding in MLLMs.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with $16 imes16$ semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3 imes$, without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

Problem

Research questions and friction points this paper is trying to address.

Integrating image segmentation in MLLMs

Simplifying segmentation via text generation

Enhancing efficiency with R-RLE compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-as-mask paradigm

Semantic descriptors mapping

Row-wise Run-Length Encoding

🔎 Similar Papers

No similar papers found.