π€ AI Summary
To address the three key challenges in Split Federated Learning (SFL)βhigh communication overhead, heavy on-device computation, and degraded model accuracy under non-IID dataβthis paper proposes a decoupled, efficient training framework. Our method introduces: (1) a unidirectional block-wise training mechanism with a locally defined loss function, eliminating gradient uploads entirely; (2) a lightweight auxiliary network generation technique that compresses frequent intermediate activation exchanges into a single transmission; and (3) a cross-device activation aggregation strategy to mitigate the impact of data heterogeneity. Extensive experiments demonstrate that, compared to state-of-the-art SFL approaches, our framework achieves up to 13.26% higher accuracy (with a 53.39% reduction in standard deviation), reduces training time by 94.6%, cuts communication overhead by 99.1%, and decreases on-device computational load by 93.13%.
π Abstract
A Federated Learning (FL) system collaboratively trains neural networks across devices and a server but is limited by significant on-device computation costs. Split Federated Learning (SFL) systems mitigate this by offloading a block of layers of the network from the device to a server. However, in doing so, it introduces large communication overheads due to frequent exchanges of intermediate activations and gradients between devices and the server and reduces model accuracy for non-IID data. We propose Ampere, a novel collaborative training system that simultaneously minimizes on-device computation and device-server communication while improving model accuracy. Unlike SFL, which uses a global loss by iterative end-to-end training, Ampere develops unidirectional inter-block training to sequentially train the device and server block with a local loss, eliminating the transfer of gradients. A lightweight auxiliary network generation method decouples training between the device and server, reducing frequent intermediate exchanges to a single transfer, which significantly reduces the communication overhead. Ampere mitigates the impact of data heterogeneity by consolidating activations generated by the trained device block to train the server block, in contrast to SFL, which trains on device-specific, non-IID activations. Extensive experiments on multiple CNNs and transformers show that, compared to state-of-the-art SFL baseline systems, Ampere (i) improves model accuracy by up to 13.26% while reducing training time by up to 94.6%, (ii) reduces device-server communication overhead by up to 99.1% and on-device computation by up to 93.13%, and (iii) reduces standard deviation of accuracy by 53.39% for various non-IID degrees highlighting superior performance when faced with heterogeneous data.