Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Accurate textual localization of positive lesions in PET/CT volumes faces challenges including weak 3D image–report alignment, scarcity of pixel- or voxel-level annotations, and difficulty localizing small lesions or regions with low radiotracer uptake. Method: We propose ConTEXTual Net 3D, the first large-scale 3D vision–language localization model for PET/CT, featuring (i) an SUV<sub>max</sub>- and axial-slice-index-driven automated weak annotation pipeline generating 11,356 sentence–volume weakly aligned pairs, and (ii) a token-level cross-modal attention mechanism that tightly fuses LLM-derived text embeddings with 3D nnU-Net volumetric features. Contribution/Results: Our model achieves an F1 score of 0.80—substantially outperforming 2.5D baselines (0.53) and LLMSeg (0.22)—and demonstrates strong generalization across diverse tracers (e.g., FDG, DOTATE), establishing a new state of the art in weakly supervised 3D lesion localization.

Technology Category

Application Category

📝 Abstract

Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model's performance was compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear medicine physicians. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments. ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22) and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94 and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL (F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine (F1=0.66). The model performed consistently across lesion sizes but showed reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs, facilitating the development of 3D visual grounding models. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.

Problem

Research questions and friction points this paper is trying to address.

PET/CT image analysis

limited annotated data

model performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ConTEXTual Net 3D

Automated Report Understanding

PET/CT Image-Text Pairing

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model