Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current general-purpose medical image segmentation methods rely on manual prompts or reference images and suffer from significant domain shifts in cross-modality scenarios, limiting their automation and robustness. This work proposes the Concept-to-Pixel (C2P) framework, the first prompt-free approach for universal medical segmentation. C2P decouples anatomical knowledge into geometric and semantic representations, distilling high-level medical concepts from a multimodal large language model into semantic tokens, while introducing explicitly supervised geometric tokens to encode physical structural constraints. The framework further incorporates a geometry-aware reasoning consistency mechanism and a dynamic kernel generation strategy. Evaluated on a unified benchmark encompassing seven imaging modalities and eight datasets, C2P substantially outperforms existing single-task and universal models, achieving state-of-the-art performance in both zero-shot and cross-modality transfer settings.

Technology Category

Application Category

📝 Abstract
Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model's predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel
Problem

Research questions and friction points this paper is trying to address.

universal medical image segmentation
visual prompts
domain shift
medical image analysis
cross-modality generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt-free segmentation
geometric-semantic disentanglement
multimodal large language models
dynamic kernel generation
geometry-aware inference consensus
🔎 Similar Papers
No similar papers found.
H
Haoyun Chen
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei, Anhui 230026, China; Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advanced Research, USTC, Suzhou, Jiangsu 215123, China; Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology, Suzhou Jiangsu, 215123, China
Fenghe Tang
Fenghe Tang
University of Science and Technology of China
Medical Image AnalysisFoundation model
Wenxin Ma
Wenxin Ma
University of Science and Technology of China
AIcomputer vision
Shaohua Kevin Zhou
Shaohua Kevin Zhou
Professor, USTC, FAIMBE, FIAMBE, FIEEE, FMICCAI, FNAI
Medical Image ComputingComputer Vision & Pattern RecognitionMachine & Deep Learning