StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scarce tabular data in specialized domains severely hinders machine learning deployment. Existing synthetic data generation methods suffer from low fidelity under data scarcity, while large language models (LLMs) neglect explicit inter-variable dependency structures. To address this, we propose the first two-stage synthesis framework integrating causal structure discovery with LLM-based generation. In Stage I, we learn a directed acyclic graph (DAG) from sparse real data to capture variable dependencies. In Stage II, we leverage the learned DAG as a structural blueprint to guide an LLM in generating high-fidelity, structure-consistent synthetic records—field-by-field in topological order—and embed differential privacy for rigorous privacy guarantees. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in both structural fidelity and downstream task performance, especially in ultra-low-data regimes, where it uniquely balances privacy preservation with practical data utility.

Technology Category

Application Category

📝 Abstract
The application of machine learning on tabular data in specialized domains is severely limited by data scarcity. While generative models offer a solution, traditional methods falter in low-data regimes, and recent Large Language Models (LLMs) often ignore the explicit dependency structure of tabular data, leading to low-fidelity synthetics. To address these limitations, we introduce StructSynth, a novel framework that integrates the generative power of LLMs with robust structural control. StructSynth employs a two-stage architecture. First, it performs explicit structure discovery to learn a Directed Acyclic Graph (DAG) from the available data. Second, this learned structure serves as a high-fidelity blueprint to steer the LLM's generation process, forcing it to adhere to the learned feature dependencies and thereby ensuring the generated data respects the underlying structure by design. Our extensive experiments demonstrate that StructSynth produces synthetic data with significantly higher structural integrity and downstream utility than state-of-the-art methods. It proves especially effective in challenging low-data scenarios, successfully navigating the trade-off between privacy preservation and statistical fidelity.
Problem

Research questions and friction points this paper is trying to address.

Overcoming data scarcity in specialized tabular data domains
Improving fidelity of synthetic data in low-data regimes
Ensuring structural integrity in LLM-generated tabular data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage architecture for tabular data synthesis
DAG-based structure discovery and control
LLM generation steered by learned dependencies