Generative Data Mining with Longtail-Guided Diffusion

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of deployed models on rare or challenging samples. We propose Long-Tailed Guidance (LTG), a post-deployment data augmentation method that requires no retraining or model updates. LTG computes differentiable long-tailed signals—such as epistemic uncertainty—in a single forward pass and leverages them to steer diffusion models in latent space, generating semantically rich, conceptually targeted samples that explicitly address model blind spots. We introduce the first coupled “long-tailed signal–diffusion generation” framework, offering both interpretability and controllability. LTG is model-agnostic, enabling long-tailed instance discovery, visual attribution, and targeted model remediation. Evaluated on image classification benchmarks, LTG significantly improves generalization performance; generated samples precisely localize and fill conceptual gaps in the target predictor’s decision boundary.

Technology Category

Application Category

📝 Abstract
It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on image classification benchmarks, and can be analyzed to proactively discover, explain, and address conceptual gaps in a predictive model.
Problem

Research questions and friction points this paper is trying to address.

Proactively discovers rare data challenges
Generates training data without retraining
Improves model generalization and explains gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Longtail-Guided Diffusion
Differentiable epistemic uncertainty
Latent diffusion model
🔎 Similar Papers
No similar papers found.