PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In medical image segmentation, text-prompted foundation models suffer from low spatial accuracy and poor generalizability, while visual prompting relies on costly manual annotations. To address this, we propose a progressive prompt enhancement framework that enables zero-shot, reliable conversion of weak textual descriptions into high-confidence bounding boxes—achieved for the first time. Our method leverages vision-language models to generate initial pseudo-bounding boxes, then refines them via uncertainty-aware filtering and adaptive expansion. Crucially, the framework decouples prompt generation from segmentation, ensuring compatibility with diverse state-of-the-art segmentation backbones. Evaluated across three cross-modality datasets, it achieves consistent improvements: Dice score gains of 3.2–5.8% and reductions in average surface distance of 12.4–19.7%, significantly outperforming both text- and visual-prompting baselines—and even surpassing some fully supervised methods. This work establishes a novel paradigm for zero-shot, anatomically precise segmentation.

Technology Category

Application Category

📝 Abstract
Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.
Problem

Research questions and friction points this paper is trying to address.

Improving spatial precision of text-prompted medical image segmentation
Generating visual prompts without costly manual bounding-box annotations
Enhancing segmentation accuracy under domain shift without labeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms text prompts into visual bounding boxes
Uses uncertainty filtering and pseudo-label training
Refines bounding boxes to guide segmentation models
🔎 Similar Papers
No similar papers found.