🤖 AI Summary
Decision trees are widely deployed for structured data modeling, yet their use of privacy-sensitive predictors—such as health records, biometric identifiers, and location trajectories—poses significant compliance risks under stringent regulations including the EU GDPR and U.S. HIPAA/CCPA.
Method: We systematically analyzed variables from hundreds of empirical ML papers, performing fine-grained mapping to regulatory definitions via legal text analysis, cross-study metadata coding, regulatory terminology standardization, and temporal trend modeling.
Contribution/Results: We introduce the first reusable, multi-jurisdictional regulatory classification framework that enables evidence-based alignment between ML practice and privacy law. Results show over 60% of reported variables fall under high-regulation categories, with health data dominating in medical applications; substantial inter-sector variation exists, and regulatory compliance gaps have widened in recent years. This work establishes an empirical benchmark and governance checklist for privacy-enhancing machine learning.
📝 Abstract
Decision-tree methods are widely used on structured tabular data and are valued for interpretability across many sectors. However, published studies often list the predictors they use (for example age, diagnosis codes, location). Privacy laws increasingly regulate such data types. We use published decision-tree papers as a proxy for real-world use of legally governed data. We compile a corpus of decision-tree studies and assign each reported predictor to a regulated data category (for example health data, biometric identifiers, children's data, financial attributes, location traces, and government IDs). We then link each category to specific excerpts in European Union and United States privacy laws. We find that many reported predictors fall into regulated categories, with the largest shares in healthcare and clear differences across industries. We analyze prevalence, industry composition, and temporal patterns, and summarize regulation-aligned timing using each framework's reference year. Our evidence supports privacy-preserving methods and governance checks, and can inform ML practice beyond decision trees.