🤖 AI Summary
Existing collaborative inference methods—such as vertical federated learning (VFL)—exhibit insufficient robustness in dynamic edge networks, where devices perceive only local environments and suffer high failure rates due to environmental disturbances or extreme weather; these approaches rely on centralized architectures and assume zero device failures.
Method: We formally define the problem of *robust collaborative inference under dynamic device failures* and propose MAGS—a novel framework integrating simulated failure training, stochastic dropout injection, model replication, and gossip-based multi-path redundant aggregation—to relax VFL’s strong assumptions of centralization and perpetual device availability.
Contribution/Results: Theoretical analysis establishes formal robustness guarantees. Experiments demonstrate that MAGS maintains stable inference performance even under up to 90% device failure rates, significantly outperforming state-of-the-art VFL and distributed baselines. This validates MAGS’s strong robustness across diverse failure scales.
📝 Abstract
When each edge device of a network only perceives a local part of the environment, collaborative inference across multiple devices is often needed to predict global properties of the environment. In safety-critical applications, collaborative inference must be robust to significant network failures caused by environmental disruptions or extreme weather. Existing collaborative learning approaches, such as privacy-focused Vertical Federated Learning (VFL), typically assume a centralized setup or that one device never fails. However, these assumptions make prior approaches susceptible to significant network failures. To address this problem, we first formalize the problem of robust collaborative inference over a dynamic network of devices that could experience significant network faults. Then, we develop a minimalistic yet impactful method called Multiple Aggregation with Gossip Rounds and Simulated Faults (MAGS) that synthesizes simulated faults via dropout, replication, and gossiping to significantly improve robustness over baselines. We also theoretically analyze our proposed approach to explain why each component enhances robustness. Extensive empirical results validate that MAGS is robust across a range of fault rates-including extreme fault rates.