Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) rely on textualized visual outputs (e.g., coordinate generation), limiting their capability for dense prediction tasks such as segmentation. To address this, we propose Patch-as-Decodable-Token (PaDT), the first framework unifying detection, segmentation, and referring expression comprehension within a single MLLM architecture. PaDT introduces learnable, decodable visual reference tokens (VRTs) that dynamically map image patch embeddings into visual tokens co-processed alongside text tokens. A lightweight decoder directly transforms LLM hidden states into pixel-level predictions, while token-wise cross-entropy loss and stochastic VRT sampling enable efficient training. Evaluated on four mainstream visual perception benchmarks, PaDT achieves state-of-the-art performance—outperforming significantly larger MLLMs—demonstrating that native dense visual understanding is feasible without sacrificing language modeling capabilities. This work establishes a new paradigm for end-to-end, unified multimodal perception in large language models.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.
Problem

Research questions and friction points this paper is trying to address.

Enables direct generation of visual outputs in multimodal language models
Overcomes limitations of indirect representations for dense prediction tasks
Unifies detection, segmentation and grounding through visual reference tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

PaDT enables direct generation of visual outputs in MLLMs
VRTs interleave visual patches with textual tokens dynamically
Lightweight decoder transforms outputs into detection and segmentation
🔎 Similar Papers
No similar papers found.
Yongyi Su
Yongyi Su
South China University of Technology
Computer VisionMachine LearningTest-Time Adaptation
H
Haojie Zhang
WeChat Vision, Tencent Inc.
S
Shijie Li
Institute for Infocomm Research (I2R), A*STAR
Nanqing Liu
Nanqing Liu
Southwest Jiaotong University
Remote SensingDeep LearningObject Detection
J
Jingyi Liao
Institute for Infocomm Research (I2R), A*STAR
J
Junyi Pan
WeChat Vision, Tencent Inc.
Y
Yuan Liu
South China University of Technology
Xiaofen Xing
Xiaofen Xing
South China University of Technology
Chong Sun
Chong Sun
Tencent WeChat
Computer Vision
C
Chen Li
WeChat Vision, Tencent Inc.
Nancy F. Chen
Nancy F. Chen
ISCA Fellow, AAIA Fellow, Multimodal Generative AI Group Leader, AI for Education Head at A*STAR
Agentic AILarge Language ModelsConversational AI
S
Shuicheng Yan
National University of Singapore
Xulei Yang
Xulei Yang
Principal Scientist & Group Leader, A*STAR, Singapore
3D VisionArtificial IntelligenceMedical Imaging
X
Xun Xu
Institute for Infocomm Research (I2R), A*STAR