🤖 AI Summary
Addressing the challenges of fine-grained benign/malignant tumor classification and precise localization in ultrasound images—as well as poor cross-device generalizability—this paper proposes a vision-language–based few-shot anomaly detection method. The method builds upon the CLIP framework and enables adaptive training with only a small number of annotated samples. Its key contributions are: (1) an image-guided prompt fusion mechanism that injects anatomical structural priors into textual prompts; (2) a frozen text memory bank to align lesion semantics with imaging features across domains; and (3) patch-level feature refinement coupled with learnable text embeddings to enhance local discriminative capability. Evaluated on three breast ultrasound datasets, the method significantly improves both lesion localization accuracy and benign/malignant classification performance, while effectively mitigating domain shift induced by heterogeneous ultrasound equipment. This work establishes a novel paradigm for clinical ultrasound–assisted diagnosis.
📝 Abstract
Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.