Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the scarcity of paired multimodal data and the difficulty of incorporating domain knowledge—particularly clinical text reports—into medical image segmentation, this paper proposes a novel vision-language joint segmentation paradigm that requires no paired image-text data. Methodologically, it (1) freezes a large language model (e.g., LLaMA) to zero-shot generate diagnosis-oriented textual instructions aligned with clinical reasoning, emulating radiologists’ image interpretation and reporting workflow; and (2) introduces a lightweight multimodal feature alignment module and an instruction-conditioned decoder to enable dynamic coupling between visual features and textual instructions. The framework is compatible with both UNet- and Transformer-based backbones. In zero-shot cross-modal (MRI/CT) and cross-organ segmentation tasks, it surpasses state-of-the-art methods by 3.2–5.8% in Dice coefficient. Qualitative visualization confirms its clinical semantic consistency and anatomical plausibility.

Technology Category

Application Category

📝 Abstract

Medical image segmentation has achieved remarkable success through the continuous advancement of UNet-based and Transformer-based foundation backbones. However, clinical diagnosis in the real world often requires integrating domain knowledge, especially textual information. Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming, posing significant challenges. Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues. Specifically, we introduce frozen LLMs for zero-shot instruction generation based on corresponding medical images, imitating the radiology scanning and report generation process. {To better approximate real-world diagnostic processes}, we generate more precise text instruction from multimodal radiology images (e.g., T1-w or T2-w MRI and CT). Based on the impressive ability of semantic understanding and rich knowledge of LLMs. This process emphasizes extracting special features from different modalities and reunion the information for the ultimate clinical diagnostic. With generated text instruction, our proposed union segmentation framework can handle multimodal segmentation without prior collected vision-language datasets. To evaluate our proposed method, we conduct comprehensive experiments with influential baselines, the statistical results and the visualized case study demonstrate the superiority of our novel method.}

Problem

Research questions and friction points this paper is trying to address.

Integrate domain knowledge with medical image segmentation

Reduce reliance on paired vision-language datasets

Improve multimodal segmentation using zero-shot LLM instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot LLM instruction for medical segmentation

Frozen LLMs generate text from medical images

Multimodal segmentation without paired datasets

🔎 Similar Papers

MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation