Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Current general-purpose medical image segmentation methods rely on manual prompts or reference images and suffer from significant domain shifts in cross-modality scenarios, limiting their automation and robustness. This work proposes the Concept-to-Pixel (C2P) framework, the first prompt-free approach for universal medical segmentation. C2P decouples anatomical knowledge into geometric and semantic representations, distilling high-level medical concepts from a multimodal large language model into semantic tokens, while introducing explicitly supervised geometric tokens to encode physical structural constraints. The framework further incorporates a geometry-aware reasoning consistency mechanism and a dynamic kernel generation strategy. Evaluated on a unified benchmark encompassing seven imaging modalities and eight datasets, C2P substantially outperforms existing single-task and universal models, achieving state-of-the-art performance in both zero-shot and cross-modality transfer settings.

Technology Category

Application Category

📝 Abstract

Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model's predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel

Problem

Research questions and friction points this paper is trying to address.

universal medical image segmentation

visual prompts

domain shift

medical image analysis

cross-modality generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt-free segmentation

geometric-semantic disentanglement

multimodal large language models