Learning from Anonymized and Incomplete Tabular Data

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge posed by user-driven privacy mechanisms in tabular data, which often result in heterogeneous anonymized datasets containing a mixture of raw values, generalized values, and missing entries—rendering conventional machine learning approaches ineffective. To tackle this issue, the paper proposes a novel data transformation strategy that systematically incorporates explicit modeling of generalization semantics into the learning pipeline. By integrating customized transformations, standard imputation techniques, and large language model capabilities, the method constructs a unified and consistent data representation. Extensive experiments demonstrate that the approach consistently recovers model performance across diverse privacy configurations and datasets, significantly outperforming naive deletion strategies. These results underscore the critical importance of thoughtfully leveraging anonymization information to enhance the utility of downstream machine learning tasks.

Technology Category

Application Category

📝 Abstract

User-driven privacy allows individuals to control whether and at what granularity their data is shared, leading to datasets that mix original, generalized, and missing values within the same records and attributes. While such representations are intuitive for privacy, they pose challenges for machine learning, which typically treats non-original values as new categories or as missing, thereby discarding generalization semantics. For learning from such tabular data, we propose novel data transformation strategies that account for heterogeneous anonymization and evaluate them alongside standard imputation and LLM-based approaches. We employ multiple datasets, privacy configurations, and deployment scenarios, demonstrating that our method reliably regains utility. Our results show that generalized values are preferable to pure suppression, that the best data preparation strategy depends on the scenario, and that consistent data representations are crucial for maintaining downstream utility. Overall, our findings highlight that effective learning is tied to the appropriate handling of anonymized values.

Problem

Research questions and friction points this paper is trying to address.

anonymized data

incomplete data

tabular data

privacy

machine learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

anonymized tabular data

user-driven privacy

data transformation