🤖 AI Summary
Differentially private random forests often suffer from significant utility degradation when applied to sensitive tabular data, limiting their practicality. This work proposes a novel approach that constructs deep random decision trees enhanced by a sample-size-aware privacy-preserving pruning strategy and integrates an $(\varepsilon, \delta)$-differentially private frequent-item detection algorithm tailored for hierarchical data, achieving an error bound of only $O_{\varepsilon,\delta}(\sqrt{\log h})$. This combination effectively supports deeper tree structures, substantially improving model expressiveness. Empirical evaluations on multiple benchmark datasets demonstrate that the proposed method outperforms existing differentially private random forest techniques under practical privacy budgets, achieving a superior privacy-utility trade-off and establishing a new state-of-the-art performance.
📝 Abstract
Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel $(\varepsilon,δ)$-DP heavy hitter detection algorithm for hierarchical data, whose error is $O_{\varepsilon,δ}(\sqrt{\log h})$ for trees of height $h$ and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.