🤖 AI Summary
Data minimization (DM) in machine learning (ML) suffers from fragmented terminology, inconsistent evaluation metrics, and divergent technical approaches, hindering practical adoption. To address this, we propose DMML—the first comprehensive, domain-specific framework for DM in ML. Grounded in a systematic literature review, DMML integrates data lifecycle analysis, threat modeling, and classification of privacy-enhancing technologies to define a unified data processing pipeline, formalize minimization nodes, and establish a principled threat model. It is the first framework to coherently unify perspectives from privacy, security, and ML—clarifying conceptual boundaries and synergies among DM-related techniques (e.g., differential privacy, feature selection, synthetic data generation). DMML delivers a reusable methodology and multi-scenario evaluation guidelines. By bridging disciplinary gaps, it provides both theoretical foundations and engineering guidance for the principled design, implementation, and verification of DM in AI/ML systems.
📝 Abstract
Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, our work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization. This framework allows us to systematically review the literature on data minimization and emph{DM-adjacent} methodologies, for the first time presenting a structured overview designed to help practitioners and researchers effectively apply DM principles. Our work facilitates a unified DM-centric understanding and broader adoption of data minimization strategies in AI/ML.