🤖 AI Summary
Existing Transformer-based approaches for tabular data generation lack domain-specific priors and suffer from poor scalability and low computational efficiency. To address these limitations, we propose a tree-enhanced hybrid architecture coupled with a dual-quantization tokenizer. Our method pioneers the integration of decision trees with Transformers to explicitly capture the non-smoothness and low pairwise correlations inherent in tabular data. The dual-quantization tokenizer jointly optimizes numerical distribution modeling and sequence compression, substantially reducing vocabulary size and sequence length. Key innovations include discretization-aware modeling, non-rotation-invariant constraints, and lightweight sequence encoding. Evaluated on ten benchmark datasets, our approach achieves a 40% improvement in utility over state-of-the-art models, compresses model size to 1/16, and significantly enhances generation fidelity, practical usability, privacy preservation, and inference efficiency.
📝 Abstract
Transformers have achieved remarkable success in tabular data generation. However, they lack domain-specific inductive biases which are critical to preserving the intrinsic characteristics of tabular data. Meanwhile, they suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that incorporates a tree-based model that retains tabular-specific inductive biases of non-smooth and potentially low-correlated patterns due to its discreteness and non-rotational invariance, and hence enhances the fidelity and utility of synthetic data. In addition, we devise a dual-quantization tokenizer to capture the multimodal continuous distribution and further facilitate the learning of numerical value distribution. Moreover, our proposed tokenizer reduces the vocabulary size and sequence length due to the limited dimension-wise semantic meaning and training set size of tabular data, rendering a significant model size shrink without sacrificing the capability of the transformer model. We evaluate TabTreeFormer on 10 datasets against multiple generative models on various metrics; our experimental results show that TabTreeFormer achieves superior fidelity, utility, privacy, and efficiency. Our best model yields a 40% utility improvement with 1/16 of the baseline model size.