Principled Federated Random Forests for Heterogeneous Data

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing federated random forest methods lack theoretical guarantees and struggle to optimize the global impurity criterion under data heterogeneity—such as covariate or label distribution shifts—leading to significant performance degradation. This work proposes FedForest, a federated random forest algorithm tailored for horizontally partitioned heterogeneous data, which approximates the centralized optimal split by aggregating client-side statistics and introduces client-specific indicators to enable non-parametric personalization. FedForest is the first to provide theoretical support for federated random forests, overcoming the limitations of conventional heuristic aggregation strategies. Experimental results demonstrate that FedForest achieves performance close to that of centralized models across diverse heterogeneity benchmarks, while maintaining communication efficiency and substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
Problem

Research questions and friction points this paper is trying to address.

Federated Learning
Random Forests
Data Heterogeneity
Horizontal Partitioning
Client Heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning
Random Forests
Data Heterogeneity
Non-parametric Personalization
Communication Efficiency
🔎 Similar Papers
No similar papers found.
R
R´emi Khellaf
Inria PreMeDICaL, Inserm, University of Montpellier, France
Erwan Scornet
Erwan Scornet
Professeur, Sorbonne Université
StatistiqueMachine Learning
A
A. Bellet
Inria PreMeDICaL, Inserm, University of Montpellier, France
Julie Josse
Julie Josse
Senior Researcher Inria,
Missing valuesLow rank matrixcausal inferenceR