🤖 AI Summary
To address privacy leakage, high bandwidth consumption, and inference latency caused by uploading raw videos for violent content moderation on short-video platforms, this paper proposes an edge-side federated video understanding framework. Methodologically, it introduces the first federated learning paradigm integrating VideoMAE-based self-supervised visual representation learning, LoRA-enabled parameter-efficient fine-tuning, and layered privacy protection—combining differentially private stochastic gradient descent (DP-SGD) with secure aggregation—alongside a lightweight communication protocol. Evaluated on the RWF-2000 dataset with 40 clients, the framework achieves 77.25% accuracy without privacy constraints and maintains 65–66% under strong differential privacy (ε ≈ 2), while reducing communication overhead by 28.3×. This work is the first to jointly leverage self-supervised visual representations and federated fine-tuning for video content moderation, significantly improving both efficiency and the privacy–utility trade-off while ensuring user data remains localized.
📝 Abstract
The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by $28.3 imes$ compared to full-model federated learning. The code is available at: {https://github.com/zyt-599/FedVideoMAE}