Compositional World Knowledge leads to High Utility Synthetic data

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the degradation of subgroup robustness caused by compositional shifts—where training data lacks coverage of all attribute combinations—this paper proposes CoInD, a novel synthetic data generation framework. CoInD explicitly incorporates compositional structures from world knowledge into synthesis, enforcing conditional independence via Fisher divergence regularization. It integrates conditional diffusion models, joint–marginal distribution alignment, and compositional causal modeling to generate high-fidelity synthetic data that fully covers all attribute combinations. This enables improved worst-group generalization under out-of-distribution compositional shifts. Evaluated on the CelebA compositional shift benchmark, CoInD achieves state-of-the-art worst-group accuracy while yielding synthetically generated samples with superior fidelity. The method effectively mitigates failure modes in generalizing to unseen attribute combinations, thereby enhancing subgroup robustness without requiring additional real-world annotations or retraining of downstream models.

Technology Category

Application Category

📝 Abstract

Machine learning systems struggle with robustness, under subpopulation shifts. This problem becomes especially pronounced in scenarios where only a subset of attribute combinations is observed during training -a severe form of subpopulation shift, referred as compositional shift. To address this problem, we ask the following question: Can we improve the robustness by training on synthetic data, spanning all possible attribute combinations? We first show that training of conditional diffusion models on limited data lead to incorrect underlying distribution. Therefore, synthetic data sampled from such models will result in unfaithful samples and does not lead to improve performance of downstream machine learning systems. To address this problem, we propose CoInD to reflect the compositional nature of the world by enforcing conditional independence through minimizing Fisher's divergence between joint and marginal distributions. We demonstrate that synthetic data generated by CoInD is faithful and this translates to state-of-the-art worst-group accuracy on compositional shift tasks on CelebA.

Problem

Research questions and friction points this paper is trying to address.

Improve robustness under compositional shift in machine learning.

Address incorrect distribution in synthetic data generation.

Enhance worst-group accuracy using faithful synthetic data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses conditional diffusion models for synthetic data

Enforces conditional independence via Fisher's divergence

Improves robustness with high utility synthetic data

🔎 Similar Papers

Scaling Synthetic Data Creation with 1,000,000,000 Personas