🤖 AI Summary
Existing radiology report classification methods suffer from three key limitations: rule-based approaches lack generalizability; supervised models require extensive labeled data; and large language model (LLM)-based solutions are predominantly closed-source, computationally expensive, and restricted to English and single-label, unimodal settings. To address these, we propose the first open-source, multilingual, label-agnostic, and lightweight classification framework built on MedGemma-4B (4B parameters). Our method integrates zero- and few-shot prompting with efficient fine-tuning and introduces domain-specific data augmentation. With only dozens of training samples, the model achieves expert-level performance across seven multilingual, multimodal datasets: it attains an average macro-F1 of 88 on five chest X-ray tasks and a weighted F1 of 82 on Danish reports using just 80 samples. The framework is deployable on consumer-grade GPUs (24 GB VRAM), significantly lowering the deployment barrier for LLMs in clinical practice.
📝 Abstract
Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.