🤖 AI Summary
This work addresses the challenges of high communication overhead and insufficient robustness to channel fluctuations and noisy inputs in multimodal inference under bandwidth-constrained wireless edge environments. The authors propose a three-stage communication-aware distributed learning framework: first, local multimodal self-supervised pretraining initializes encoders without device–server interaction; second, evidence-based fusion enables distributed fine-tuning; and third, an uncertainty-guided feedback mechanism dynamically balances communication efficiency and inference accuracy. By integrating uncertainty awareness into multimodal edge inference for the first time, the method significantly reduces communication rounds and improves accuracy on RGB–depth indoor scene classification tasks, while demonstrating superior robustness over existing self-supervised and fully supervised approaches under modality dropout or channel perturbations.
📝 Abstract
Semantic communication is emerging as a key enabler for distributed edge intelligence due to its capability to convey task-relevant meaning. However, achieving communication-efficient training and robust inference over wireless links remains challenging. This challenge is further exacerbated for multi-modal edge inference (MMEI) by two factors: 1) prohibitive communication overhead for distributed learning over bandwidth-limited wireless links, due to the \emph{multi-modal} nature of the system; and 2) limited robustness under varying channels and noisy multi-modal inputs. In this paper, we propose a three-stage communication-aware distributed learning framework to improve training and inference efficiency while maintaining robustness over wireless channels. In Stage~I, devices perform local multi-modal self-supervised learning to obtain shared and modality-specific encoders without device--server exchange, thereby reducing the communication cost. In Stage~II, distributed fine-tuning with centralized evidential fusion calibrates per-modality uncertainty and reliably aggregates features distorted by noise or channel fading. In Stage~III, an uncertainty-guided feedback mechanism selectively requests additional features for uncertain samples, optimizing the communication--accuracy tradeoff in the distributed setting. Experiments on RGB--depth indoor scene classification show that the proposed framework attains higher accuracy with far fewer training communication rounds and remains robust to modality degradation or channel variation, outperforming existing self-supervised and fully supervised baselines.