🤖 AI Summary
In federated learning, aggregated local evaluation metrics often diverge from centralized assessment results, leading to misleading performance estimates. This work presents the first systematic analysis of the root causes of this discrepancy and introduces FLAM (Federated Learning Aggregatable Metrics), a general framework for consistent metric aggregation. Through rigorous mathematical derivation, FLAM establishes necessary conditions for metrics to ensure global consistency and devises a distributed evaluation protocol that enables accurate aggregation of diverse performance measures without requiring a global test set. Empirical evaluations across multiple benchmark tasks demonstrate that FLAM precisely reproduces centralized evaluation outcomes, substantially enhancing the reliability and applicability of model assessment in federated settings and overcoming the prevailing limitation of existing approaches to accuracy alone.
📝 Abstract
Performance evaluation is essential for assessing the quality of machine learning (ML) models and guiding deployment decisions. In federated learning (FL), assessing the performance is challenging because data are distributed across participants. Consequently, the coordinator must rely on locally computed evaluation metrics and aggregate them to assess the global model. A key challenge is that common aggregation strategies, such as weighted averaging based on the local samples per participant, do not always produce the same results as centralized evaluation. Existing definitions of performance evaluation are largely tailored to accuracy and do not generalize to other metrics, leading to inconsistencies between participant-based and centralized evaluation. However, such discrepancies are inconsistent with the FL objective and lead to a wrong calculation of the metric.
To address this issue, we examine the underlying reasons for these discrepancies and propose FLAM, a performance evaluation method based on aggregatable measures that yields the same results as centralized evaluation without the need for a global test dataset.