DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language model (LLM)-driven approaches for CT report generation typically encode 3D images holistically, struggling to distinguish diagnostically relevant lesions from redundant anatomical background. Inspired by radiologists’ “cognitive subtraction,” this work proposes a differential visual-semantic prompting mechanism that introduces, for the first time, a learnable structured visual prefix. By explicitly modeling multi-granular semantic differences between scan and reference images, the method enhances diagnostic evidence and suppresses invariant structures—without requiring explicit lesion localization. The framework integrates a hierarchical difference extractor with a difference-aware prompt generator to effectively guide the LLM toward more accurate report generation. Evaluated on two large-scale CT datasets, the approach substantially outperforms current methods, achieving average improvements of 10.98 and 4.36 in BLEU-1 to BLEU-4 scores, respectively, and attaining an F1 score of 0.421 on RadGenome-ChestCT.

Technology Category

Application Category

📝 Abstract
While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at https://github.com/ArielTYH/DiffVP/.
Problem

Research questions and friction points this paper is trying to address.

CT report generation
visual prompting
semantic differences
large language models
anatomical background
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Visual Prompting
LLM-based report generation
semantic difference extraction
visual prefix tokens
cognitive subtraction
🔎 Similar Papers
No similar papers found.
Y
Yuhe Tian
Department of Electronic Engineering and Information Science, School of Information Science and Technology, University of Science and Technology of China (USTC), Hefei, Anhui 230026, China
K
Kun Zhang
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei, Anhui 230026, China
Haoran Ma
Haoran Ma
PhD Student, University of California, Los Angeles
Computer SystemsSoftware Engineering
Rui Yan
Rui Yan
Zhejiang University of Technology
Deep Neural Networks
Yingtai Li
Yingtai Li
University of Science & Technology of China
Rongsheng Wang
Rongsheng Wang
The Chinese University of Hong Kong, Shenzhen
Deep Learning
Shaohua Kevin Zhou
Shaohua Kevin Zhou
Professor, USTC, FAIMBE, FIAMBE, FIEEE, FMICCAI, FNAI
Medical Image ComputingComputer Vision & Pattern RecognitionMachine & Deep Learning