🤖 AI Summary
To address key bottlenecks in bandwidth-constrained scenarios—including low perception accuracy, disjointed communication and sensing functionalities, and limited task versatility—this paper proposes a semantics-driven radar-vision multimodal sensing-communication integrated framework. Our method introduces three core innovations: (1) a novel multimodal semantic fusion network leveraging cross-modal attention for deep alignment of radar and visual features; (2) a large language model–guided, channel-adaptive semantic codec that jointly optimizes perception decoding and semantic transmission; and (3) a unified semantic space mapping with multi-task collaborative decoding architecture. Simulation results demonstrate substantial improvements over unimodal and conventional decoupled approaches across multiple tasks—including object detection, pose estimation, and action recognition—achieving a 37% reduction in end-to-end latency and a 2.1× increase in semantic transmission efficiency.
📝 Abstract
Traditional single-modality sensing faces limitations in accuracy and capability, and its decoupled implementation with communication systems increases latency in bandwidth-constrained environments. Additionally, single-task-oriented sensing systems fail to address users' diverse demands. To overcome these challenges, we propose a semantic-driven integrated multimodal sensing and communication (SIMAC) framework. This framework leverages a joint source-channel coding architecture to achieve simultaneous sensing decoding and transmission of sensing results. Specifically, SIMAC first introduces a multimodal semantic fusion (MSF) network, which employs two extractors to extract semantic information from radar signals and images, respectively. MSF then applies cross-attention mechanisms to fuse these unimodal features and generate multimodal semantic representations. Secondly, we present a large language model (LLM)-based semantic encoder (LSE), where relevant communication parameters and multimodal semantics are mapped into a unified latent space and input to the LLM, enabling channel-adaptive semantic encoding. Thirdly, a task-oriented sensing semantic decoder (SSD) is proposed, in which different decoded heads are designed according to the specific needs of tasks. Simultaneously, a multi-task learning strategy is introduced to train the SIMAC framework, achieving diverse sensing services. Finally, experimental simulations demonstrate that the proposed framework achieves diverse sensing services and higher accuracy.