CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing imputation methods for tabular data often neglect the structural patterns of missingness and the semantic context of fields. To address this, we propose CACTI, a masked autoencoder framework that jointly models missingness patterns and field semantics. Its core innovation is the novel median-truncated copy-masking training strategy, which synergistically injects statistical priors about missingness patterns and semantic priors derived from column names and textual descriptions, thereby enabling dual-source inductive bias optimization. CACTI integrates text embeddings, a copy-masking mechanism, and a missingness-aware training objective. Evaluated under MCAR, MAR, and MNAR missingness mechanisms, CACTI achieves an average R² improvement of 7.8% over state-of-the-art baselines, with gains as high as 13.4% under MNAR—demonstrating substantial superiority in challenging non-ignorable missingness scenarios.

Technology Category

Application Category

📝 Abstract

We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Problem

Research questions and friction points this paper is trying to address.

Improving tabular data imputation using masking and context

Learning from missingness patterns and feature relationships

Enhancing imputation accuracy across diverse missing data scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked autoencoding for tabular data imputation

Median truncated copy masking training strategy

Incorporates semantic relationships between features

🔎 Similar Papers

Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets