Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current medical vision-language models face four key bottlenecks in Alzheimer’s disease (AD) diagnosis: underutilization of patient metadata, lack of clinical knowledge integration, high computational overhead, and weak 3D structural modeling for volumetric neuroimaging. To address these, we propose a lightweight cross-modal prompt-tuning framework. Our method innovatively converts structured patient metadata into synthetic clinical reports to enrich the textual modality and incorporates Mini-Mental State Examination (MMSE) scores as auxiliary token-prediction targets to provide clinically grounded supervision. Furthermore, the framework jointly processes 3D CT/MRI volumes and textual inputs to enable efficient multimodal alignment and diagnostic reasoning. Evaluated on two AD benchmark datasets, our approach achieves superior performance using only 1,500 training images—outperforming state-of-the-art fine-tuned models trained on 10,000 images—thereby significantly improving data efficiency and clinical deployability.

Technology Category

Application Category

📝 Abstract

Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

Underutilization of patient metadata in Med-VLMs

Lack of clinical diagnostic knowledge integration

Inefficient 3D medical imaging adaptation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts metadata into synthetic reports for alignment

Adds auxiliary token predicting MMSE clinical score

Uses lightweight prompt tuning for efficient fine-tuning

🔎 Similar Papers

No similar papers found.