To Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation

📅 2024-12-26

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address the issue that uniform random masking in tabular data imputation distorts the true missingness distribution, this paper proposes a dynamic proportional masking autoencoder framework that preserves the observed missingness ratio. Methodologically, it introduces (1) a novel proportional masking strategy that dynamically generates masks based on the original data’s missing rate, faithfully retaining the underlying missing patterns; (2) an MLP-based token mixing mechanism—designed as an efficient, expressive alternative to attention for tabular modeling; and (3) a lightweight masked autoencoder architecture supporting heterogeneous data types and diverse missingness mechanisms (e.g., MCAR, MAR). Experiments across multiple benchmark datasets demonstrate significant improvements in imputation accuracy, particularly under non-random missingness scenarios. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Masked autoencoders (MAEs) have recently demonstrated effectiveness in tabular data imputation. However, due to the inherent heterogeneity of tabular data, the uniform random masking strategy commonly used in MAEs can disrupt the distribution of missingness, leading to suboptimal performance. To address this, we propose a proportional masking strategy for MAEs. Specifically, we first compute the statistics of missingness based on the observed proportions in the dataset, and then generate masks that align with these statistics, ensuring that the distribution of missingness is preserved after masking. Furthermore, we argue that simple MLP-based token mixing offers competitive or often superior performance compared to attention mechanisms while being more computationally efficient, especially in the tabular domain with the inherent heterogeneity. Experimental results validate the effectiveness of the proposed proportional masking strategy across various missing data patterns in tabular datasets. Code is available at: url{https://github.com/normal-kim/PMAE}.

Problem

Research questions and friction points this paper is trying to address.

Data Imputation

Mixed Data Types

Missing Patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proportional Masking Autoencoder

Table Data Imputation

Efficient Computation

🔎 Similar Papers

Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets