🤖 AI Summary
Laryngoscopic image scarcity and insufficient annotation severely limit the generalizability of computer-aided diagnosis and explanation (CADx/e) systems in otolaryngology. To address this, we propose a clinically guided synthetic data generation framework that—novelty—integrates latent diffusion models (LDMs) with ControlNet, while explicitly incorporating anatomical structures and pathological features as clinical priors to synthesize high-fidelity, diverse, and precisely annotated laryngoscopic images. Expert blind evaluation confirms no statistically significant perceptual difference between synthetic and real images (p > 0.05). Incorporating only 10% synthetic data improves internal detection performance by 9% and cross-domain generalization by 22.1%, substantially alleviating the small-sample bottleneck in specialized clinical domains. This work establishes a generalizable, clinically grounded paradigm for trustworthy AI modeling in medical imaging scenarios characterized by data scarcity.
📝 Abstract
Although computer-aided diagnosis (CADx) and detection (CADe) systems have made significant progress in various medical domains, their application is still limited in specialized fields such as otorhinolaryngology. In the latter, current assessment methods heavily depend on operator expertise, and the high heterogeneity of lesions complicates diagnosis, with biopsy persisting as the gold standard despite its substantial costs and risks. A critical bottleneck for specialized endoscopic CADx/e systems is the lack of well-annotated datasets with sufficient variability for real-world generalization. This study introduces a novel approach that exploits a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate laryngeal endoscopic image-annotation pairs, guided by clinical observations. The method addresses data scarcity by conditioning the diffusion process to produce realistic, high-quality, and clinically relevant image features that capture diverse anatomical conditions. The proposed approach can be leveraged to expand training datasets for CADx/e models, empowering the assessment process in laryngology. Indeed, during a downstream task of detection, the addition of only 10% synthetic data improved the detection rate of laryngeal lesions by 9% when the model was internally tested and 22.1% on out-of-domain external data. Additionally, the realism of the generated images was evaluated by asking 5 expert otorhinolaryngologists with varying expertise to rate their confidence in distinguishing synthetic from real images. This work has the potential to accelerate the development of automated tools for laryngeal disease diagnosis, offering a solution to data scarcity and demonstrating the applicability of synthetic data in real-world scenarios.