Synthetic Data Generation for Augmenting Small Samples

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalizability and suboptimal prognostic performance of machine learning models trained on small-sample clinical datasets, this paper proposes the first synthetic data augmentation framework explicitly designed for clinical interpretability. The framework integrates generative models—including GANs, VAEs, and SMOTE variants—and employs a dual-criterion selection strategy based on generated-data diversity and AUC gain. It further introduces an interpretable decision-support model to assess augmentation applicability, uncovering for the first time systematic associations between augmentation efficacy and intrinsic data characteristics (e.g., baseline AUC, class cardinality, outcome balance). Evaluated on seven real-world small-sample medical datasets, the framework achieves an average AUC improvement of 15.55% (up to +43.23%), significantly outperforming conventional resampling methods (p = 0.016), while yielding synthetically augmented data with significantly higher diversity (p = 0.046).

Technology Category

Application Category

📝 Abstract
Small datasets are common in health research. However, the generalization performance of machine learning models is suboptimal when the training datasets are small. To address this, data augmentation is one solution. Augmentation increases sample size and is seen as a form of regularization that increases the diversity of small datasets, leading them to perform better on unseen data. We found that augmentation improves prognostic performance for datasets that: have fewer observations, with smaller baseline AUC, have higher cardinality categorical variables, and have more balanced outcome variables. No specific generative model consistently outperformed the others. We developed a decision support model that can be used to inform analysts if augmentation would be useful. For seven small application datasets, augmenting the existing data results in an increase in AUC between 4.31% (AUC from 0.71 to 0.75) and 43.23% (AUC from 0.51 to 0.73), with an average 15.55% relative improvement, demonstrating the nontrivial impact of augmentation on small datasets (p=0.0078). Augmentation AUC was higher than resampling only AUC (p=0.016). The diversity of augmented datasets was higher than the diversity of resampled datasets (p=0.046).
Problem

Research questions and friction points this paper is trying to address.

Machine Learning
Small Data Sets
Health Research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Augmentation
Machine Learning
Small Datasets
🔎 Similar Papers
No similar papers found.
D
Dan Liu
CHEO Research Institute, Ottawa, Canada; University of Ottawa, Ottawa, Canada
Samer El Kababji
Samer El Kababji
Princess Sumaya University of Technology
Machine LearningSynthetic Data Generation
N
Nicholas Mitsakakis
CHEO Research Institute, Ottawa, Canada
L
Lisa Pilgram
CHEO Research Institute, Ottawa, Canada; University of Ottawa, Ottawa, Canada; Department of Nephrology and Medical Intensive Care, Charité - Universitaetsmedizin Berlin, Berlin, Germany
T
Thomas Walters
Hospital for Sick Children, Toronto, Canada
M
Mark Clemons
Ottawa Hospital Research Institute, Ottawa, Canada; Division of Medical Oncology, Department of Medicine, University of Ottawa, Ontario, Canada
G
G. Pond
McMaster University, Hamilton, Ottawa
A
Alaa El-Hussuna
OpenSourceResearch, Aalborg, Denmark
K
K. E. Emam
CHEO Research Institute, Ottawa, Canada; University of Ottawa, Ottawa, Canada