🤖 AI Summary
This work uncovers a novel privacy threat in horizontal federated learning (HFL) involving tree-based models (e.g., XGBoost, LightGBM): a single malicious client can reconstruct other participants’ sensitive training data by reverse-engineering split points and decision paths. To address this, we propose TimberStrike—the first dataset reconstruction attack tailored to tree structures—integrating gradient approximation, constrained search, and split-value inversion modeling, with cross-framework compatibility (Flower, NVFlare, FedTree). Our study is the first to systematically expose the inherent privacy fragility of federated tree models, establishing a reconstruction optimization framework grounded in discrete structural properties. We further identify a fundamental utility–privacy trade-off bottleneck for differential privacy in this setting. Evaluated on a stroke prediction dataset, TimberStrike achieves reconstruction accuracy of 73.05%–95.63%, demonstrating that existing defenses fail to simultaneously ensure privacy guarantees and model utility.
📝 Abstract
Federated Learning has emerged as a privacy-oriented alternative to centralized Machine Learning, enabling collaborative model training without direct data sharing. While extensively studied for neural networks, the security and privacy implications of tree-based models remain underexplored. This work introduces TimberStrike, an optimization-based dataset reconstruction attack targeting horizontally federated tree-based models. Our attack, carried out by a single client, exploits the discrete nature of decision trees by using split values and decision paths to infer sensitive training data from other clients. We evaluate TimberStrike on State-of-the-Art federated gradient boosting implementations across multiple frameworks, including Flower, NVFlare, and FedTree, demonstrating their vulnerability to privacy breaches. On a publicly available stroke prediction dataset, TimberStrike consistently reconstructs between 73.05% and 95.63% of the target dataset across all implementations. We further analyze Differential Privacy, showing that while it partially mitigates the attack, it also significantly degrades model performance. Our findings highlight the need for privacy-preserving mechanisms specifically designed for tree-based Federated Learning systems, and we provide preliminary insights into their design.