🤖 AI Summary
High communication overhead during training and poor adaptability to multimodal tasks hinder semantic communication in dynamic wireless environments. Method: This paper proposes the first self-supervised pretraining-fine-tuning framework for multimodal semantic communication. It jointly models visual and depth modalities, introducing a co-learning mechanism that enforces both modality-invariant and modality-specific representations to achieve disentangled semantic feature extraction. Computationally intensive pretraining is offloaded to edge or cloud servers via self-supervision, drastically reducing on-device training communication. Results: Evaluated on NYU Depth V2, the method reduces training communication cost significantly compared to supervised baselines, while maintaining or improving semantic reconstruction fidelity and downstream task performance—achieving, for the first time, effective decoupling of training and inference communication burdens.
📝 Abstract
Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.