Towards Universal Debiasing for Language Models-based Tabular Data Generation

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Large language models (LLMs) tend to amplify historical biases when generating tabular data, particularly exacerbating inter-group unfairness in settings involving multiple sensitive attributes (e.g., gender, race, geography). To address this, we propose a general debiasing framework that enhances fairness by minimizing group-level mutual information between advantageous and protected attributes. Our approach introduces two key innovations: (i) UDF-MIX—a tuning-free method integrating autoregressive modeling with analytical sampling distribution estimation—and (ii) a DPO-based debiasing strategy compatible with existing LLMs, enabling efficient joint debiasing across multiple sensitive attributes. Experiments demonstrate that the framework significantly improves multiple fairness metrics—including statistical parity and equal opportunity—while preserving data utility and generation quality. It further exhibits strong scalability and is applicable to high-stakes domains such as finance and healthcare for fair tabular data synthesis.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved promising results in tabular data generation. However, inherent historical biases in tabular datasets often cause LLMs to exacerbate fairness issues, particularly when multiple advantaged and protected features are involved. In this work, we introduce a universal debiasing framework that minimizes group-level dependencies by simultaneously reducing the mutual information between advantaged and protected attributes. By leveraging the autoregressive structure and analytic sampling distributions of LLM-based tabular data generators, our approach efficiently computes mutual information, reducing the need for cumbersome numerical estimations. Building on this foundation, we propose two complementary methods: a direct preference optimization (DPO)-based strategy, namely UDF-DPO, that integrates seamlessly with existing models, and a targeted debiasing technique, namely UDF-MIX, that achieves debiasing without tuning the parameters of LLMs. Extensive experiments demonstrate that our framework effectively balances fairness and utility, offering a scalable and practical solution for debiasing in high-stakes applications.

Problem

Research questions and friction points this paper is trying to address.

Addressing historical biases in tabular datasets amplified by LLMs

Reducing unfair dependencies between advantaged and protected attributes

Developing scalable debiasing methods for LLM-based data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal debiasing framework reduces mutual information between attributes

Leverages autoregressive structure for efficient mutual information computation

Offers DPO-based and parameter-free debiasing methods for flexibility

🔎 Similar Papers

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)