MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Industrial-scale tabular data often confronts challenges such as high dimensionality, extensive missing values, and scarce annotations, with a notable absence of a general-purpose self-supervised pretraining framework. This work proposes MaskTab, a unified pretraining approach that employs learnable missing tokens to distinguish between structural and random missingness, integrates a dual-path hybrid supervision architecture to jointly optimize masked reconstruction and downstream task objectives, and incorporates a Mixture-of-Experts (MoE)-enhanced loss with knowledge distillation. Evaluated on industrial benchmarks, MaskTab achieves substantial performance gains (AUC +5.04%, KS +8.28%). Moreover, the distilled lightweight model retains strong performance under stringent latency and interpretability constraints (AUC +2.55%, KS +4.85%) while demonstrating enhanced robustness to distribution shifts.
📝 Abstract
Tabular data forms the backbone of high-stakes decision systems in finance, healthcare, and beyond. Yet industrial tabular datasets are inherently difficult: high-dimensional, riddled with missing entries, and rarely labeled at scale. While foundation models have revolutionized vision and language, tabular learning still leans on handcrafted features and lacks a general self-supervised framework. We present MaskTab, a unified pre-training framework designed specifically for industrial-scale tabular data. MaskTab encodes missing values via dedicated learnable tokens, enabling the model to distinguish structural absence from random dropout. It jointly optimizes a hybrid supervised pre-training scheme--utilizing a twin-path architecture to reconcile masked reconstruction with task-specific supervision--and an MoE-augmented loss that adaptively routes features through specialized subnetworks. On industrial-scale benchmarks, it achieves +5.04% AUC and +8.28% KS over prior art under rigorous scaling. Moreover, its representations distill effectively into lightweight models, yielding +2.55% AUC and +4.85% KS under strict latency and interpretability constraints, while improving robustness to distribution shifts. Our work demonstrates that tabular data admits a foundation-model treatment--when its structural idiosyncrasies are respected.
Problem

Research questions and friction points this paper is trying to address.

tabular data
missing values
self-supervised learning
industrial classification
foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Tabular Pretraining
Mixture of Experts (MoE)
Missing Value Tokenization
Hybrid Supervised Pretraining
Knowledge Distillation
🔎 Similar Papers
No similar papers found.