Hybrid Autoencoders for Tabular Data: Leveraging Model-Based Augmentation in Low-Label Settings

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address the challenges of modeling high-frequency sharp signals in tabular data, poor generalization of deep neural networks under low-label regimes, and the lack of effective augmentation strategies for self-supervised learning, this paper proposes a neural-tree hybrid autoencoder framework. It tightly couples a deep autoencoder with an oblivious soft decision tree, employing a dual-encoder architecture and sample-adaptive gating to generate model-driven, complementary input views—without explicit data augmentation. Joint training is achieved via cross-reconstruction loss and a shared decoder, while spectral analysis reveals complementary inductive biases between the two components. Evaluated on multiple low-label tabular classification and regression tasks, the method consistently outperforms state-of-the-art deep models and supervised tree-based baselines, demonstrating superior representation learning capability and generalization.

Technology Category

Application Category

📝 Abstract

Deep neural networks often under-perform on tabular data due to their sensitivity to irrelevant features and a spectral bias toward smooth, low-frequency functions. These limitations hinder their ability to capture the sharp, high-frequency signals that often define tabular structure, especially under limited labeled samples. While self-supervised learning (SSL) offers promise in such settings, it remains challenging in tabular domains due to the lack of effective data augmentations. We propose a hybrid autoencoder that combines a neural encoder with an oblivious soft decision tree (OSDT) encoder, each guided by its own stochastic gating network that performs sample-specific feature selection. Together, these structurally different encoders and model-specific gating networks implement model-based augmentation, producing complementary input views tailored to each architecture. The two encoders, trained with a shared decoder and cross-reconstruction loss, learn distinct yet aligned representations that reflect their respective inductive biases. During training, the OSDT encoder (robust to noise and effective at modeling localized, high-frequency structure) guides the neural encoder toward representations more aligned with tabular data. At inference, only the neural encoder is used, preserving flexibility and SSL compatibility. Spectral analysis highlights the distinct inductive biases of each encoder. Our method achieves consistent gains in low-label classification and regression across diverse tabular datasets, outperforming deep and tree-based supervised baselines.

Problem

Research questions and friction points this paper is trying to address.

Addresses neural network limitations with tabular data in low-label settings

Develops hybrid autoencoder for model-based augmentation without effective data transformations

Improves classification and regression performance on diverse tabular datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid autoencoder combines neural and tree encoders

Model-based augmentation creates tailored input views

Cross-reconstruction aligns representations with inductive biases

🔎 Similar Papers

Autoencoder-based General Purpose Representation Learning for Customer Embedding