🤖 AI Summary
This work addresses the dual challenges of privacy leakage and excessive communication overhead in federated video action recognition, where gradient exchange can expose sensitive motion patterns and full-model synchronization imposes substantial bandwidth demands. To mitigate these issues, the authors propose FedDP-STECAR, a novel framework that uniquely integrates differential privacy with selective layer fine-tuning. By applying perturbations and transmitting only a small subset of task-relevant network layers under a strict privacy budget (ε=0.65), the method drastically reduces communication costs while preserving temporal feature consistency. Evaluated on the UCF-101 dataset using the MViT-B-16x4 architecture, FedDP-STECAR achieves 73.1% accuracy in federated settings—an improvement of 70.2% over baseline approaches—while cutting communication overhead by over 99% and accelerating training by 48%.
📝 Abstract
Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99\% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2\% higher accuracy} under strict privacy ($ε=0.65$) in centralized settings and \textbf{48\% faster training} with \textbf{73.1\% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at https://github.com/izakariyya/mvit-federated-videodp