🤖 AI Summary
Existing methods for synthetic tabular data generation struggle with heterogeneity, logical consistency, coverage of rare events, and performance in low-data regimes. This work proposes a hierarchical hybrid framework that decouples semantic structure from stochastic texture: top-down, it incorporates structure-driven logical constraints and cross-modal alignment rules; bottom-up, it employs a lightweight generator to model local statistical patterns. These components are integrated through a unified synthesis engine and an iterative feedback mechanism. The approach achieves a strong balance among controllability, semantic coherence, and statistical fidelity, significantly outperforming neural baselines on a weakly multimodal financial benchmark. It effectively enhances performance consistency across training, synthetic, testing, and real-world data distributions while preserving semantic integrity.
📝 Abstract
Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, while in the bottom-up path, lightweight tabular generators are used to learn local statistical patterns from real data. The two paths are consolidated in a unified synthesis engine with an iterative feedback loop. We evaluate the framework on weak multimodal financial benchmarks combining tabular and sentiment-text data. Experimental results show that our H-TDBU approach improves train-synthetic-test-real performance over neural baseline methods while preserving semantic consistency. Our results suggest that hierarchical rule-guided synthesis provides an effective mechanism for combining controllability, semantic coherence, and statistical fidelity in synthetic data generation.