SAMTok: Representing Any Mask with Two Words

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the challenge of enabling multimodal large language models (MLLMs) to efficiently support pixel-level tasks, which has been hindered by complex region encoders, task-specific decoders, and incompatible training objectives. The authors propose SAMTok, a novel approach that compresses arbitrary region masks into just two discrete language tokens, allowing standard MLLMs to learn pixel-level understanding and generation through conventional language modeling—without architectural modifications or custom loss functions. Built upon SAM2, SAMTok integrates a mask encoder with a residual vector quantizer and is trained on 209 million masks, further refined via reinforcement learning with a text-mask alignment reward. An implementation based on Qwen-VL achieves state-of-the-art or competitive performance on region captioning, region-based VQA, and referring segmentation, while significantly improving interactive segmentation results on the GRES and GCG benchmarks.

Technology Category

Application Category

📝 Abstract

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

Problem

Research questions and friction points this paper is trying to address.

pixel-wise capabilities

multimodal LLMs

mask representation

scalable segmentation

interactive intelligent systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

mask tokenization

pixel-wise MLLM

discrete representation