🤖 AI Summary
To address the poor robustness of unimodal (RF/visual) sensing and limited coverage of single-device deployments in complex urban and non-line-of-sight (NLoS) scenarios, this paper proposes an LLM-driven distributed multimodal sensing and semantic communication framework. Methodologically, we design a Radio-Frequency–Vision cross-modal fusion network (RVFN), an LLM-guided semantic transmission network (LSTN), and an adaptive aggregation Transformer (TRAM), integrated with a two-stage distributed learning protocol to preserve data privacy. Evaluated on a multi-view RF-visual synthetic dataset generated by the Genesis simulation engine, our framework achieves significant improvements in perception accuracy and semantic fidelity, reduces communication bandwidth by 37%, and enhances noise resilience. The core contribution lies in the first realization of an end-to-end, LLM-enabled cross-modal semantic collaborative sensing and lightweight, efficient semantic communication loop.
📝 Abstract
Traditional single-modal sensing systems-based solely on either radio frequency (RF) or visual data-struggle to cope with the demands of complex and dynamic environments. Furthermore, single-device systems are constrained by limited perspectives and insufficient spatial coverage, which impairs their effectiveness in urban or non-line-of-sight scenarios. To overcome these challenges, we propose a novel large language model (LLM)-driven distributed integrated multimodal sensing and semantic communication (LLM-DiSAC) framework. Specifically, our system consists of multiple collaborative sensing devices equipped with RF and camera modules, working together with an aggregation center to enhance sensing accuracy. First, on sensing devices, LLM-DiSAC develops an RF-vision fusion network (RVFN), which employs specialized feature extractors for RF and visual data, followed by a cross-attention module for effective multimodal integration. Second, a LLM-based semantic transmission network (LSTN) is proposed to enhance communication efficiency, where the LLM-based decoder leverages known channel parameters, such as transceiver distance and signal-to-noise ratio (SNR), to mitigate semantic distortion. Third, at the aggregation center, a transformer-based aggregation model (TRAM) with an adaptive aggregation attention mechanism is developed to fuse distributed features and enhance sensing accuracy. To preserve data privacy, a two-stage distributed learning strategy is introduced, allowing local model training at the device level and centralized aggregation model training using intermediate features. Finally, evaluations on a synthetic multi-view RF-visual dataset generated by the Genesis simulation engine show that LLM-DiSAC achieves a good performance.