Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for synthetic tabular data generation struggle with heterogeneity, logical consistency, coverage of rare events, and performance in low-data regimes. This work proposes a hierarchical hybrid framework that decouples semantic structure from stochastic texture: top-down, it incorporates structure-driven logical constraints and cross-modal alignment rules; bottom-up, it employs a lightweight generator to model local statistical patterns. These components are integrated through a unified synthesis engine and an iterative feedback mechanism. The approach achieves a strong balance among controllability, semantic coherence, and statistical fidelity, significantly outperforming neural baselines on a weakly multimodal financial benchmark. It effectively enhances performance consistency across training, synthetic, testing, and real-world data distributions while preserving semantic integrity.
📝 Abstract
Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, while in the bottom-up path, lightweight tabular generators are used to learn local statistical patterns from real data. The two paths are consolidated in a unified synthesis engine with an iterative feedback loop. We evaluate the framework on weak multimodal financial benchmarks combining tabular and sentiment-text data. Experimental results show that our H-TDBU approach improves train-synthetic-test-real performance over neural baseline methods while preserving semantic consistency. Our results suggest that hierarchical rule-guided synthesis provides an effective mechanism for combining controllability, semantic coherence, and statistical fidelity in synthetic data generation.
Problem

Research questions and friction points this paper is trying to address.

synthetic tabular data generation
data heterogeneity
logical consistency
rare-event coverage
low-data regimes
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical synthesis
top-down and bottom-up
tabular data generation
semantic consistency
hybrid framework