DALL-M: Context-Aware Clinical Data Augmentation with LLMs

📅 2024-07-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
X-ray diagnosis is often hindered by the absence of structured clinical context. To address this, we propose the first three-stage context-aware synthetic data augmentation framework—comprising *clinical knowledge grounding*, *expert-level query generation*, and *LLM-driven feature synthesis*—which integrates authoritative medical knowledge (e.g., Radiopaedia) to intelligently expand patient vital signs, imaging findings, and demographic attributes. Our method supports both enhancement of existing features and generation of novel, clinically relevant features, ensuring fidelity and interpretability. Evaluated on MIMIC-IV (799 cases), it extends the original 9-dimensional clinical feature space to 91 dimensions, yielding a 16.5% improvement in F1-score and 25% gains in both precision and recall. Integration with downstream models—including TabNET—demonstrates consistent and significant performance gains in diagnostic classification tasks.

Technology Category

Application Category

📝 Abstract
X-ray images are vital in medical diagnostics, but their effectiveness is limited without clinical context. Radiologists often find chest X-rays insufficient for diagnosing underlying diseases, necessitating the integration of structured clinical features with radiology reports. To address this, we introduce DALL-M, a novel framework that enhances clinical datasets by generating contextual synthetic data. DALL-M augments structured patient data, including vital signs (e.g., heart rate, oxygen saturation), radiology findings (e.g., lesion presence), and demographic factors. It integrates this tabular data with contextual knowledge extracted from radiology reports and domain-specific resources (e.g., Radiopaedia, Wikipedia), ensuring clinical consistency and reliability. DALL-M follows a three-phase process: (i) clinical context storage, (ii) expert query generation, and (iii) context-aware feature augmentation. Using large language models (LLMs), it generates both contextual synthetic values for existing clinical features and entirely new, clinically relevant features. Applied to 799 cases from the MIMIC-IV dataset, DALL-M expanded the original 9 clinical features to 91. Empirical validation with machine learning models (including Decision Trees, Random Forests, XGBoost, and TabNET) demonstrated a 16.5% improvement in F1 score and a 25% increase in Precision and Recall. DALL-M bridges an important gap in clinical data augmentation by preserving data integrity while enhancing predictive modeling in healthcare. Our results show that integrating LLM-generated synthetic features significantly improves model performance, making DALL-M a scalable and practical approach for AI-driven medical diagnostics.
Problem

Research questions and friction points this paper is trying to address.

Enhances clinical datasets with contextual synthetic data
Integrates structured patient data with radiology reports
Improves predictive modeling in healthcare diagnostics
Innovation

Methods, ideas, or system contributions that make the work stand out.

DALL-M integrates LLMs for clinical data augmentation.
Generates synthetic features from radiology reports and resources.
Improves ML model performance in medical diagnostics.
🔎 Similar Papers
No similar papers found.
C
Chih-Jou Hsieh
Queensland University of Technology, Brisbane, Australia
Catarina Moreira
Catarina Moreira
Associate Professor in Machine Learning @Data Science Institute, UTS
Explainable-AIHuman-Centered AIDeep LearningProbabilistic ModelsQuantum Cognition
I
Isabel Blanco Nobre
Imagiology Department, Grupo Lusíadas, Lisboa, Portugal
S
Sandra Costa Sousa
Imagiology Department, Grupo Lusíadas, Lisboa, Portugal
Chun Ouyang
Chun Ouyang
Associate Professor, PhD, Queensland University of Technology
Process MiningExplainable AIPredictive AnalyticsAI RobustnessMachine Learning
M
M. Brereton
Queensland University of Technology, Brisbane, Australia
J
Joaquim A. Jorge
Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Jacinto C. Nascimento
Jacinto C. Nascimento
Institute for Systems and Robotics (ISR/IST), LARSyS, Instituto Superior Técnico
Signal ProcessingMachine LearningComputer VisionRobotics