DMin: Scalable Training Data Influence Estimation for Diffusion Models

📅 2024-12-11
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Tracing data influence in large-scale diffusion models (DMs) remains challenging—existing methods suffer from prohibitive computational and memory overhead, preventing scalability to billion-parameter models. Method: We propose the first scalable influence estimation framework, integrating efficient gradient compression, low-rank Hessian-vector product approximation, and memory-aware online backpropagation to drastically reduce both computational and storage complexity. Contribution/Results: Our method enables millisecond-level Top-k influential sample retrieval for 10-billion-parameter DMs (<1 second per sample); compresses storage requirements from terabytes to kilobytes—a reduction exceeding six orders of magnitude (10⁶×); and maintains high identification accuracy while ensuring end-to-end scalability. This work establishes a practical foundation for data provenance, debugging, and safety governance of large foundation models.

Technology Category

Application Category

📝 Abstract
Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models (DMs), yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. To the best of our knowledge, it is the first method capable of influence estimation for DMs with billions of parameters. Leveraging efficient gradient compression, DMin reduces storage requirements from hundreds of TBs to mere MBs or even KBs, and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.
Problem

Research questions and friction points this paper is trying to address.

Scalable influence estimation for diffusion models
Reducing storage and computational requirements
Identifying top influential training samples efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable influence estimation for diffusion models
Efficient gradient compression reduces storage needs
Rapid retrieval of top influential training samples
🔎 Similar Papers
No similar papers found.