🤖 AI Summary
Leveraging the semantic reasoning capabilities of large language models (LLMs) for medical image segmentation without incurring substantial trainable parameter overhead remains challenging. Method: We propose LLM4Seg—a novel framework that integrates frozen, pre-trained LLMs (e.g., LLaMA, DeepSeek) into a CNN encoder-decoder architecture, enabling direct processing of visual tokens and endowing the model with strong semantic awareness. Semantic enhancement of visual tokens is achieved via lightweight fine-tuning, introducing only minimal trainable parameters. Contribution/Results: Evaluated across multimodal medical imaging modalities—including ultrasound, dermoscopy, colonoscopy, and CT—LLM4Seg consistently improves segmentation performance, enhancing both global contextual modeling and local detail fidelity. Crucially, it transfers LLMs’ semantic understanding to purely visual segmentation tasks *without* requiring vision-language alignment pretraining. This establishes a new low-parameter paradigm for multimodal medical image segmentation.
📝 Abstract
With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM's semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek.