Principled Federated Random Forests for Heterogeneous Data

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing federated random forest methods lack theoretical guarantees and struggle to optimize the global impurity criterion under data heterogeneity—such as covariate or label distribution shifts—leading to significant performance degradation. This work proposes FedForest, a federated random forest algorithm tailored for horizontally partitioned heterogeneous data, which approximates the centralized optimal split by aggregating client-side statistics and introduces client-specific indicators to enable non-parametric personalization. FedForest is the first to provide theoretical support for federated random forests, overcoming the limitations of conventional heuristic aggregation strategies. Experimental results demonstrate that FedForest achieves performance close to that of centralized models across diverse heterogeneity benchmarks, while maintaining communication efficiency and substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.

Problem

Research questions and friction points this paper is trying to address.

Federated Learning

Random Forests

Data Heterogeneity

Horizontal Partitioning

Client Heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning

Random Forests

Data Heterogeneity