MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Industrial-scale tabular data often confronts challenges such as high dimensionality, extensive missing values, and scarce annotations, with a notable absence of a general-purpose self-supervised pretraining framework. This work proposes MaskTab, a unified pretraining approach that employs learnable missing tokens to distinguish between structural and random missingness, integrates a dual-path hybrid supervision architecture to jointly optimize masked reconstruction and downstream task objectives, and incorporates a Mixture-of-Experts (MoE)-enhanced loss with knowledge distillation. Evaluated on industrial benchmarks, MaskTab achieves substantial performance gains (AUC +5.04%, KS +8.28%). Moreover, the distilled lightweight model retains strong performance under stringent latency and interpretability constraints (AUC +2.55%, KS +4.85%) while demonstrating enhanced robustness to distribution shifts.

📝 Abstract

Tabular data forms the backbone of high-stakes decision systems in finance, healthcare, and beyond. Yet industrial tabular datasets are inherently difficult: high-dimensional, riddled with missing entries, and rarely labeled at scale. While foundation models have revolutionized vision and language, tabular learning still leans on handcrafted features and lacks a general self-supervised framework. We present MaskTab, a unified pre-training framework designed specifically for industrial-scale tabular data. MaskTab encodes missing values via dedicated learnable tokens, enabling the model to distinguish structural absence from random dropout. It jointly optimizes a hybrid supervised pre-training scheme--utilizing a twin-path architecture to reconcile masked reconstruction with task-specific supervision--and an MoE-augmented loss that adaptively routes features through specialized subnetworks. On industrial-scale benchmarks, it achieves +5.04% AUC and +8.28% KS over prior art under rigorous scaling. Moreover, its representations distill effectively into lightweight models, yielding +2.55% AUC and +4.85% KS under strict latency and interpretability constraints, while improving robustness to distribution shifts. Our work demonstrates that tabular data admits a foundation-model treatment--when its structural idiosyncrasies are respected.

Problem

Research questions and friction points this paper is trying to address.

tabular data

missing values

self-supervised learning

industrial classification

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Tabular Pretraining

Mixture of Experts (MoE)

Missing Value Tokenization