🤖 AI Summary
Text-to-image diffusion models suffer from mode collapse on medical long-tailed data (e.g., rare pathologies), severely degrading generation quality and diversity for tail classes. To address this, we propose GRASP: (1) We first identify gradient conflict between head and tail classes as the primary cause; (2) We introduce a static sample clustering scheme coupled with cluster-specific residual adapters—injecting them into the pre-trained diffusion model’s Transformer feed-forward layers—without learnable gating, ensuring both training stability and inference efficiency; (3) Leveraging external anatomical and pathological priors, we enable fine-grained conditional optimization. On MIMIC-CXR-LT and NIH-CXR-LT, GRASP achieves significantly lower FID and higher diversity scores than state-of-the-art baselines, while improving downstream classification accuracy by up to 8.2%. Generalizability is further validated on ImageNet-LT.
📝 Abstract
Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.