Flow Matching for Tabular Data Synthesis

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenges of low efficiency, high privacy risks, and insufficient utility in tabular data synthesis. We propose TabbyFlow, a novel flow-matching-based approach that integrates optimal transport paths with variance-preserving probability paths, enabling both deterministic and randomized samplers; it further introduces a variable-flow matching mechanism to jointly optimize utility and privacy. Experiments demonstrate that TabbyFlow achieves superior generation quality over diffusion models (e.g., TabDDPM, TabSyn) within ≤100 sampling steps, significantly improving computational efficiency. It attains higher statistical fidelity and better downstream task performance across multiple real-world tabular datasets. Moreover, path randomization enables controllable mitigation of membership inference attacks and other privacy threats. To our knowledge, this is the first systematic study establishing the superiority of flow matching for tabular synthesis—offering a new paradigm for efficient, practical, and privacy-aware synthetic data generation.

Technology Category

Application Category

📝 Abstract

Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using extit{variational} flow matching -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ($leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.

Problem

Research questions and friction points this paper is trying to address.

Compares flow matching methods with diffusion models for tabular data synthesis.

Evaluates probability paths and samplers for balancing data utility and privacy risk.

Demonstrates flow matching's computational advantage and superior performance over baselines.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow matching for tabular data synthesis

Compares OT and VP probability paths

Uses deterministic and stochastic samplers

🔎 Similar Papers

CTSyn: A Foundational Model for Cross Tabular Data Generation