RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of facial phenotyping in children with rare diseases, which is hindered by the scarcity of real-world data and high phenotypic similarity across conditions. To overcome this, the authors introduce RDFace—the first ethically compliant, metadata-standardized benchmark dataset comprising 456 pediatric facial images across 103 rare genetic disorders—and propose a phenotype-fidelity-preserving synthesis method guided by facial landmark constraints. By integrating DreamBooth with FastGAN, the approach generates semantically meaningful synthetic images that closely mimic real clinical phenotypes. Experiments demonstrate that augmenting limited real data with these synthetic samples boosts diagnostic accuracy by up to 13.7% under extremely low-data regimes, while generated phenotypic descriptions achieve a similarity score of 0.84 against real clinical reports, significantly advancing few-shot AI-assisted diagnosis for rare diseases.
📝 Abstract
Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.
Problem

Research questions and friction points this paper is trying to address.

rare disease
facial phenotype
data scarcity
pediatric diagnosis
medical imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

rare disease facial analysis
extreme data scarcity
phenotype-aware synthetic generation
medical image augmentation
vision-language validation
🔎 Similar Papers
No similar papers found.
G
Ganlin Feng
Western University
Y
Yuxi Long
Western University
H
Hafsa Ali
Concordia University
E
Erin Lou
University of Toronto
F
Fahad Butt
Western University
Q
Qian Liu
University of Winnipeg
Yang Wang
Yang Wang
Computer Science, Concordia University
computer visionmachine learningdeep learningartificial intelligence
Pingzhao Hu
Pingzhao Hu
Canada Research Chair, Associate Prof, Western University, Associate Prof., Univ. of Toronto
BioinformaticsStatistic GeneticsDeep LearningHealth Data ScienceMedical Imaging