Governance by Evidence: Regulated Predictors in Decision-Tree Models

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Decision trees are widely deployed for structured data modeling, yet their use of privacy-sensitive predictors—such as health records, biometric identifiers, and location trajectories—poses significant compliance risks under stringent regulations including the EU GDPR and U.S. HIPAA/CCPA. Method: We systematically analyzed variables from hundreds of empirical ML papers, performing fine-grained mapping to regulatory definitions via legal text analysis, cross-study metadata coding, regulatory terminology standardization, and temporal trend modeling. Contribution/Results: We introduce the first reusable, multi-jurisdictional regulatory classification framework that enables evidence-based alignment between ML practice and privacy law. Results show over 60% of reported variables fall under high-regulation categories, with health data dominating in medical applications; substantial inter-sector variation exists, and regulatory compliance gaps have widened in recent years. This work establishes an empirical benchmark and governance checklist for privacy-enhancing machine learning.

Technology Category

Application Category

📝 Abstract

Decision-tree methods are widely used on structured tabular data and are valued for interpretability across many sectors. However, published studies often list the predictors they use (for example age, diagnosis codes, location). Privacy laws increasingly regulate such data types. We use published decision-tree papers as a proxy for real-world use of legally governed data. We compile a corpus of decision-tree studies and assign each reported predictor to a regulated data category (for example health data, biometric identifiers, children's data, financial attributes, location traces, and government IDs). We then link each category to specific excerpts in European Union and United States privacy laws. We find that many reported predictors fall into regulated categories, with the largest shares in healthcare and clear differences across industries. We analyze prevalence, industry composition, and temporal patterns, and summarize regulation-aligned timing using each framework's reference year. Our evidence supports privacy-preserving methods and governance checks, and can inform ML practice beyond decision trees.

Problem

Research questions and friction points this paper is trying to address.

Identifies regulated data categories in decision-tree predictors

Links predictors to EU and US privacy law excerpts

Analyzes prevalence and patterns of regulated data usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mapping predictors to regulated data categories

Linking categories to EU and US privacy laws

Analyzing prevalence and industry patterns

🔎 Similar Papers

A Unified Approach to Extract Interpretable Rules from Tree Ensembles via Integer Programming