STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Medical image segmentation often suffers from reduced accuracy due to highly variable lesion scales and uncertain spatial distributions. To address this, we propose a vision-language collaborative network integrated with multi-scale textual prompts and introduce the first retrieval-segmentation joint learning paradigm. During training, semantically relevant prompts are dynamically retrieved from a self-constructed medical text corpus to enhance visual feature representation; during inference, no textual input is required—enabling seamless cross-modal knowledge transfer and lightweight deployment. The method comprises multi-scale textual encoding, vision-language feature alignment, and end-to-end retrieval-augmented joint training. Extensive experiments on COVID-Xray, COVID-CT, and Kvasir-SEG demonstrate significant improvements over state-of-the-art methods, particularly in segmenting small-sized and irregularly shaped lesions.

Technology Category

Application Category

📝 Abstract

Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.

Problem

Research questions and friction points this paper is trying to address.

Improves lesion segmentation accuracy in medical images

Addresses uncertainty in lesion distribution and size

Bridges semantic gap between visual and linguistic modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language modeling for segmentation

Uses multi-scale text to guide lesion localization

Retrieves medical text for cross-modal learning

🔎 Similar Papers

One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts