A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models

📅 2024-05-28

🏛️ International Conference on Computational Linguistics

📈 Citations: 2

✨ Influential: 1

career value

150K/year

🤖 AI Summary

This work addresses the challenge of missing value imputation in tabular data under three canonical missingness mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). We propose CRILM, the first method to leverage pretrained language models (e.g., BERT, LLaMA) for missing data imputation. CRILM employs context-aware prompt engineering to generate semantically rich textual descriptions—replacing conventional numerical imputation—and thereby mitigates modeling bias inherent in MNAR settings, where missingness depends on unobserved true values. Adopting a “large-model generation + small-model fine-tuning” collaborative paradigm, CRILM achieves computational efficiency while enhancing downstream task performance. Extensive experiments demonstrate that CRILM consistently outperforms state-of-the-art baselines across all three missingness mechanisms, with up to 10% absolute improvement in predictive accuracy, and exhibits strong robustness under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract

This paper presents a novel approach named extbf{C}ontextually extbf{R}elevant extbf{I}mputation leveraging pre-trained extbf{L}anguage extbf{M}odels ( extbf{CRILM}) for handling missing data in tabular datasets. Instead of relying on traditional numerical estimations, CRILM uses pre-trained language models (LMs) to create contextually relevant descriptors for missing values. This method aligns datasets with LMs' strengths, allowing large LMs to generate these descriptors and small LMs to be fine-tuned on the enriched datasets for enhanced downstream task performance. Our evaluations demonstrate CRILM's superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios, with up to a 10% improvement over the best-performing baselines. By mitigating biases, particularly in MNAR settings, CRILM improves downstream task performance and offers a cost-effective solution for resource-constrained environments.

Problem

Research questions and friction points this paper is trying to address.

Handling missing data in tabular datasets using context-aware imputation

Leveraging pre-trained language models for generating contextually relevant descriptors

Improving downstream task performance and mitigating biases in missing data scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained language models for missing data

Generates contextually relevant descriptors for values

Fine-tunes small LMs on enriched datasets

🔎 Similar Papers

No similar papers found.