FairCauseSyn: Towards Causally Fair LLM-Augmented Synthetic Data Generation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic data methods in healthcare predominantly target counterfactual fairness, neglecting causal fairness modeling and failing to preserve the underlying causal structure of real medical data. Method: We propose the first table-based synthetic data generation framework integrating large language models (LLMs) with causal inference, explicitly modeling causal pathways involving sensitive attributes. Our approach employs an LLM-enhanced generator coupled with causally constrained adversarial training to ensure causal structure invariance. Contribution/Results: Evaluated on real-world healthcare datasets, our method achieves <10% deviation on causal fairness metrics; predictive models trained on the synthetic data exhibit a 70% reduction in bias with respect to sensitive attributes. This work pioneers the unification of LLM-driven generation and causal fairness in medical tabular data synthesis, substantially enhancing fairness and trustworthiness in health research and clinical decision-making.

Technology Category

Application Category

📝 Abstract
Synthetic data generation creates data based on real-world data using generative models. In health applications, generating high-quality data while maintaining fairness for sensitive attributes is essential for equitable outcomes. Existing GAN-based and LLM-based methods focus on counterfactual fairness and are primarily applied in finance and legal domains. Causal fairness provides a more comprehensive evaluation framework by preserving causal structure, but current synthetic data generation methods do not address it in health settings. To fill this gap, we develop the first LLM-augmented synthetic data generation method to enhance causal fairness using real-world tabular health data. Our generated data deviates by less than 10% from real data on causal fairness metrics. When trained on causally fair predictors, synthetic data reduces bias on the sensitive attribute by 70% compared to real data. This work improves access to fair synthetic data, supporting equitable health research and healthcare delivery.
Problem

Research questions and friction points this paper is trying to address.

Ensuring causal fairness in synthetic health data generation
Addressing bias in sensitive attributes for equitable outcomes
Developing LLM-augmented methods for causally fair synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-augmented synthetic data generation
Enhances causal fairness in health data
Reduces bias by 70% on sensitive attributes