Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses the challenge of pediatric rare disease identification, which is hindered by the scarcity of real-world data, stringent privacy constraints, and difficulties in data sharing—factors that impede the development and clinical deployment of computer vision models. The authors propose a phenotype-aware, high-fidelity synthetic facial image generation paradigm that enables model training without any real patient data. They systematically demonstrate the efficacy of purely synthetic data across multiple backbone architectures. Experimental results show that, at sufficient scale, models trained exclusively on synthetic data achieve performance comparable to those trained on real data, effectively approximating clinically relevant phenotypic distributions. This work opens new avenues for privacy-preserving applications in rare disease diagnosis, genetic counseling, and medical education.

📝 Abstract

Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children's healthcare.

Problem

Research questions and friction points this paper is trying to address.

data scarcity

pediatric rare disease

facial phenotype

synthetic data

privacy constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data

pediatric rare disease

facial phenotype