Large Language Model-Driven Distributed Integrated Multimodal Sensing and Semantic Communications

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor robustness of unimodal (RF/visual) sensing and limited coverage of single-device deployments in complex urban and non-line-of-sight (NLoS) scenarios, this paper proposes an LLM-driven distributed multimodal sensing and semantic communication framework. Methodologically, we design a Radio-Frequency–Vision cross-modal fusion network (RVFN), an LLM-guided semantic transmission network (LSTN), and an adaptive aggregation Transformer (TRAM), integrated with a two-stage distributed learning protocol to preserve data privacy. Evaluated on a multi-view RF-visual synthetic dataset generated by the Genesis simulation engine, our framework achieves significant improvements in perception accuracy and semantic fidelity, reduces communication bandwidth by 37%, and enhances noise resilience. The core contribution lies in the first realization of an end-to-end, LLM-enabled cross-modal semantic collaborative sensing and lightweight, efficient semantic communication loop.

Technology Category

Application Category

📝 Abstract
Traditional single-modal sensing systems-based solely on either radio frequency (RF) or visual data-struggle to cope with the demands of complex and dynamic environments. Furthermore, single-device systems are constrained by limited perspectives and insufficient spatial coverage, which impairs their effectiveness in urban or non-line-of-sight scenarios. To overcome these challenges, we propose a novel large language model (LLM)-driven distributed integrated multimodal sensing and semantic communication (LLM-DiSAC) framework. Specifically, our system consists of multiple collaborative sensing devices equipped with RF and camera modules, working together with an aggregation center to enhance sensing accuracy. First, on sensing devices, LLM-DiSAC develops an RF-vision fusion network (RVFN), which employs specialized feature extractors for RF and visual data, followed by a cross-attention module for effective multimodal integration. Second, a LLM-based semantic transmission network (LSTN) is proposed to enhance communication efficiency, where the LLM-based decoder leverages known channel parameters, such as transceiver distance and signal-to-noise ratio (SNR), to mitigate semantic distortion. Third, at the aggregation center, a transformer-based aggregation model (TRAM) with an adaptive aggregation attention mechanism is developed to fuse distributed features and enhance sensing accuracy. To preserve data privacy, a two-stage distributed learning strategy is introduced, allowing local model training at the device level and centralized aggregation model training using intermediate features. Finally, evaluations on a synthetic multi-view RF-visual dataset generated by the Genesis simulation engine show that LLM-DiSAC achieves a good performance.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of single-modal sensing in dynamic environments
Enhancing sensing accuracy with distributed multimodal fusion
Improving communication efficiency via LLM-driven semantic transmission
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven RF-vision fusion network for multimodal sensing
LLM-based semantic transmission to reduce distortion
Transformer-based adaptive aggregation for distributed features
🔎 Similar Papers
No similar papers found.
Yubo Peng
Yubo Peng
南京大学
semantic communicationsgenerative artificial intelligencedeep learning
Luping Xiang
Luping Xiang
Research professor @ Nanjing University
wireless communication
Bingxin Zhang
Bingxin Zhang
Nanjing University
Wireless Communications
K
Kun Yang
State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China, and the School of Intelligent Software and Engineering, Nanjing University (Suzhou Campus), Suzhou, China