🤖 AI Summary
This study addresses a critical yet overlooked issue in generative data augmentation for medical imaging: despite the high-fidelity image reconstruction enabled by pretrained autoencoders, their latent representations often exhibit poor learnability by downstream classifiers, creating a “learnability gap” between synthetic and real data. The work formally defines and empirically validates this phenomenon, demonstrating that the underlying bottleneck lies in the structure of the latent space—not in reconstruction fidelity or domain adaptation. To bridge this gap, the authors propose an efficient classification architecture integrating a noise-conditional latent classifier, FiLM layers, and image-space distillation, alongside a novel diagnostic tool to assess latent space quality. Experiments across four medical imaging tasks confirm the ubiquity of the learnability gap and show that the proposed method achieves a 64× throughput improvement and 120× memory reduction.
📝 Abstract
Generative data augmentation with latent diffusion models is a promising strategy for addressing class imbalance in medical imaging, yet current approaches focus on perceptual fidelity and domain-specific autoencoder fine-tuning while neglecting a more fundamental bottleneck. We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. Across five autoencoder families and four medical benchmarks spanning chest radiography, dermatoscopy, computed tomography, and echocardiography, we show that this gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and that medical-domain fine-tuning of the autoencoder does not close it. To probe and partially narrow the gap, we develop noise-conditioned latent classifiers with FiLM layers and image-space distillation that offer 64x throughput and 120x memory gains over image-space models while serving as diagnostic tools for latent space quality. Our analysis provides a new framework for evaluating autoencoder latent spaces and identifies their structure, rather than their fidelity or domain specificity, as the primary obstacle to closing the performance gap between real and synthetic medical training data.