An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges in diagnosing lumbar spinal stenosis—namely high annotation costs, substantial inter-observer variability, and extreme class imbalance—by proposing an end-to-end interpretable vision-language model. The method introduces a novel spatial block cross-attention mechanism to preserve anatomical hierarchical structure, enabling precise lesion localization guided by textual prompts. Furthermore, it pioneers the integration of control theory into loss function design through an adaptive PID-Tversky loss that dynamically optimizes segmentation of rare and difficult cases. The model simultaneously generates clinical reports and achieves state-of-the-art performance, significantly outperforming existing approaches in classification accuracy (90.69%), segmentation macro-averaged Dice score (0.9512), and CIDEr score (92.80), thereby offering both high precision and strong interpretability.
📝 Abstract
Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.
Problem

Research questions and friction points this paper is trying to address.

Lumbar Spinal Stenosis
Vision-Language Model
Class Imbalance
Spatial Accuracy
MRI Interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable Vision-Language Model
Adaptive PID-Tversky Loss
Spatial Patch Cross-Attention
Class Imbalance
Automated Radiology Report Generation
🔎 Similar Papers
No similar papers found.