RadVLM: A Multitask Conversational Vision-Language Model for Radiology

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Current vision-language models (VLMs) lack interactive diagnostic capabilities for chest X-ray (CXR) analysis. To address this gap, we propose the first lightweight, multi-task VLM tailored to radiological clinical needs—supporting report generation, abnormality classification, visual grounding, and multi-turn, multi-task dialogues. We introduce a radiology-specific instruction-tuning paradigm that jointly optimizes single-turn discriminative/generative tasks and multi-turn dialogue training. To enable this, we construct a large-scale CXR-instruction dataset comprising over one million instruction-response pairs, with both single-turn and multi-turn annotations. Experiments demonstrate state-of-the-art performance on dialogue understanding and visual grounding, while maintaining top-tier accuracy across other radiological tasks. Ablation studies confirm that joint multi-task training significantly enhances generalization and robustness in few-shot settings.

Technology Category

Application Category

📝 Abstract

The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

Problem

Research questions and friction points this paper is trying to address.

Automated CXR analysis and reporting

Interactive diagnostic capabilities in VLMs

Multitask conversational model for radiology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multitask conversational foundation model

Large-scale instruction dataset curation

State-of-the-art conversational capabilities

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training