Medical Referring Image Segmentation via Next-Token Mask Prediction

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of architectural complexity, reliance on multimodal fusion, and multi-stage decoders in medical reference image segmentation (MRIS), this paper proposes NTP-MRISeg—a unified multimodal sequence autoregressive framework that models referential expression-guided segmentation as next-token prediction. Methodologically, it introduces three key innovations: k-next-token prediction to mitigate exposure bias, token-level contrastive learning to enhance discriminability, and memory-based hard example mining to tackle long-tailed class distributions and ambiguous lesion boundaries. The framework employs a pretrained multimodal tokenizer to jointly tokenize images, referring texts, and segmentation masks—eliminating modality-specific fusion modules and external segmentation heads. Evaluated on QaTa-COV19 and MosMedData+, NTP-MRISeg achieves state-of-the-art performance, demonstrating significant improvements in segmentation accuracy and cross-dataset generalization.

Technology Category

Application Category

📝 Abstract
Medical Referring Image Segmentation (MRIS) involves segmenting target regions in medical images based on natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. In this work, we propose NTP-MRISeg, a novel framework that reformulates MRIS as an autoregressive next-token prediction task over a unified multimodal sequence of tokenized image, text, and mask representations. This formulation streamlines model design by eliminating the need for modality-specific fusion and external segmentation models, supports a unified architecture for end-to-end training. It also enables the use of pretrained tokenizers from emerging large-scale multimodal models, enhancing generalization and adaptability. More importantly, to address challenges under this formulation-such as exposure bias, long-tail token distributions, and fine-grained lesion edges-we propose three novel strategies: (1) a Next-k Token Prediction (NkTP) scheme to reduce cumulative prediction errors, (2) Token-level Contrastive Learning (TCL) to enhance boundary sensitivity and mitigate long-tail distribution effects, and (3) a memory-based Hard Error Token (HET) optimization strategy that emphasizes difficult tokens during training. Extensive experiments on the QaTa-COV19 and MosMedData+ datasets demonstrate that NTP-MRISeg achieves new state-of-the-art performance, offering a streamlined and effective alternative to traditional MRIS pipelines.
Problem

Research questions and friction points this paper is trying to address.

Streamlining multimodal fusion in medical image segmentation
Addressing exposure bias in autoregressive mask prediction
Enhancing boundary sensitivity for fine-grained lesion segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive next-token prediction for unified multimodal sequences
Next-k token prediction to reduce cumulative errors
Token-level contrastive learning for boundary sensitivity enhancement
🔎 Similar Papers
No similar papers found.
X
Xinyu Chen
School of Electrical and Computer Engineering, University of Sydney, Sydney, NSW 2006, Australia
Y
Yiran Wang
School of Electrical and Computer Engineering, University of Sydney, Sydney, NSW 2006, Australia
G
Gaoyang Pang
School of Electrical and Computer Engineering, University of Sydney, Sydney, NSW 2006, Australia
J
Jiafu Hao
School of Electrical and Computer Engineering, University of Sydney, Sydney, NSW 2006, Australia
Chentao Yue
Chentao Yue
The University of Sydney
Coding TheoryInformation Theory
Luping Zhou
Luping Zhou
School of Electrical and Computer Engineering, University of Sydney
Medical ImagingComputer VisionMachine Learning
Yonghui Li
Yonghui Li
the University of Sydney
Wireless communicationsChannel codingInternet of ThingsSignal ProcessingGame theory