Leveraging Text-to-Image Generation for Handling Spurious Correlation

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In image classification, deep models trained via Empirical Risk Minimization (ERM) often exploit spurious correlations between labels and irrelevant visual features, leading to poor out-of-distribution generalization. To address this, we propose the first causal disentanglement framework integrating textual inversion, language-guided segmentation, and diffusion-based generation: it enables semantic-controllable image recomposition to attenuate spurious associations. Furthermore, we introduce a dual-criterion sample pruning mechanism—leveraging both prediction probabilities and attribution scores—to enhance the quality of synthetically generated data. Evaluated across multiple benchmarks, our method significantly improves worst-group accuracy (WGA), surpassing state-of-the-art approaches. This work establishes a scalable, generative solution for causal robust learning, advancing the principled integration of vision-language priors and diffusion modeling in distributionally robust classification.

Technology Category

Application Category

📝 Abstract
Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a technique to generate training samples with text-to-image (T2I) diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model's reliance on spurious correlations by learning from carefully crafted samples for in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Address spurious correlations in image classification models
Generate synthetic training samples using text-to-image diffusion
Improve model generalization to out-of-distribution data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-image diffusion models generate training samples
Textual inversion identifies causal visual features
Pruning ensures correct sample composition
🔎 Similar Papers
No similar papers found.
A
Aryan Yazdan Parast
University of Melbourne, Melbourne, Australia
Basim Azam
Basim Azam
Postdoctoral Research Fellow at The University of Melbourne
Deep LearningComputer VisionPattern Recognition
N
Naveed Akhtar
University of Melbourne, Melbourne, Australia