TACTFL: Temporal Contrastive Training for Multi-modal Federated Learning with Similarity-guided Model Aggregation

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the dual challenges of label scarcity and multimodal data heterogeneity in federated learning, this paper proposes the first semi-supervised federated learning framework tailored for multimodal time-series data. Methodologically, it innovatively integrates modality-agnostic temporal contrastive learning with cross-modal representation alignment and introduces a similarity-guided dynamic aggregation strategy based on representation consistency to mitigate client-level semantic drift. Technically, the framework unifies self-supervised pretraining, federated averaging optimization, and modality-adaptive weight aggregation, enabling joint modeling of video, audio, and wearable sensor data. Extensive experiments on benchmarks such as UCF101 demonstrate significant improvements over state-of-the-art methods: using only 10% labeled data, the framework achieves 68.48% top-1 accuracy—outperforming the FedOpt baseline by 33.13 percentage points.

Technology Category

Application Category

📝 Abstract
Real-world federated learning faces two key challenges: limited access to labelled data and the presence of heterogeneous multi-modal inputs. This paper proposes TACTFL, a unified framework for semi-supervised multi-modal federated learning. TACTFL introduces a modality-agnostic temporal contrastive training scheme that conducts representation learning from unlabelled client data by leveraging temporal alignment across modalities. However, as clients perform self-supervised training on heterogeneous data, local models may diverge semantically. To mitigate this, TACTFL incorporates a similarity-guided model aggregation strategy that dynamically weights client models based on their representational consistency, promoting global alignment. Extensive experiments across diverse benchmarks and modalities, including video, audio, and wearable sensors, demonstrate that TACTFL achieves state-of-the-art performance. For instance, on the UCF101 dataset with only 10% labelled data, TACTFL attains 68.48% top-1 accuracy, significantly outperforming the FedOpt baseline of 35.35%. Code will be released upon publication.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited labeled data in multi-modal federated learning systems
Managing semantic divergence from heterogeneous client data training
Improving global model alignment across diverse modalities and clients
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-agnostic temporal contrastive training for representation learning
Similarity-guided model aggregation for dynamic client weighting
Unified semi-supervised framework for multi-modal federated learning
🔎 Similar Papers
No similar papers found.
G
Guanxiong Sun
School of Engineering Mathematics and Technology, University of Bristol, Bristol, UK
Majid Mirmehdi
Majid Mirmehdi
Professor of Computer Vision, FIAPR, FBMVA, University of Bristol
Computer Vision and Pattern Recognition
Z
Zahraa Abdallah
School of Engineering Mathematics and Technology, University of Bristol, Bristol, UK
R
Raul Santos-Rodriguez
School of Engineering Mathematics and Technology, University of Bristol, Bristol, UK
Ian Craddock
Ian Craddock
University of Bristol
CommunicationsElectromagneticsIoTAnrenna8
Telmo de Menezes e Silva Filho
Telmo de Menezes e Silva Filho
Senior Lecturer in Data Science, School of Engineering Maths and Technology, University of Bristol
machine learningdata sciencecomputer visionnatural language processingsymbolic data analysis