Totoro$^+$: An Adaptive and Scalable Edge Federated Learning System

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the poor scalability and limited dynamic adaptability of traditional federated learning systems that rely on centralized parameter servers, particularly in large-scale edge environments. The authors propose a fully decentralized peer-to-peer architecture based on Distributed Hash Tables (DHT), which dynamically assigns logical parameter servers per application and enables nodes to flexibly assume roles such as coordination, aggregation, or training. Key innovations include a locality-aware P2P multi-ring topology, a publish/subscribe-based forest abstraction mechanism, and a game-theoretic routing model with guarantees of ε-approximate Nash equilibrium. Experimental results on a 500-node EC2 cluster demonstrate that the system scales gracefully with both application count and cluster size, achieving 1.2–14× faster training, 𝒪(log N)-hop model dissemination and gradient aggregation, and robust resilience to network dynamics and node churn.

📝 Abstract

Federated Learning (FL) is an emerging distributed machine learning (ML) technique that enables in-situ model training and inference on decentralized edge devices. We propose Totoro$^+$, a novel scalable FL system that enables massive FL applications to run simultaneously on edge networks. The key insight is to explore a distributed hash table (DHT)-based peer-to-peer (P2P) model to re-architect the centralized FL system design into a fully decentralized one. In contrast to previous studies where many FL applications shared one centralized parameter server, Totoro$^+$ assigns a dedicated parameter server to each application. Any edge node can act as any application's coordinator, aggregator, client selector, worker (participant device), or any combination of the above, thereby radically improving scalability and adaptivity. Totoro$^+$ introduces three innovations to realize its design: a locality-aware P2P multi-ring structure, a publish/subscribe-based forest abstraction, and a game-theoretic path planning model with a guarantee of an $ε$-approximate Nash equilibrium. Real-world experiments on 500 Amazon EC2 servers show that Totoro$^+$ scales gracefully with the number of FL applications and $N$ edge nodes speeds up the total training time by $1.2\times-14.0\times$, achieves $\mathcal{O}(\log N)$ hops for model dissemination and gradient aggregation with millions of nodes, and efficiently adapts to the practical edge networks and churns.

Problem

Research questions and friction points this paper is trying to address.

Federated Learning

Edge Computing

Scalability

Decentralization

Parameter Server

Innovation

Methods, ideas, or system contributions that make the work stand out.

decentralized federated learning

distributed hash table (DHT)

peer-to-peer multi-ring