🤖 AI Summary
Existing multimodal semantic communication systems employ modality-specific encoding, resulting in low spectral efficiency and cross-modal inconsistency. This paper proposes the first unified vision-language semantic communication framework: it leverages pretrained vision-language models (e.g., CLIP) to extract compact, joint semantic representations; these are transmitted over the channel and subsequently used at the receiver to jointly drive a diffusion model for image generation and a decoder-based language model for text generation. By abandoning modality-isolated encoding, the framework enables dual-modality reconstruction from a single semantic feature stream. Experiments demonstrate that, under low signal-to-noise ratios, the system achieves significantly higher semantic fidelity and cross-modal consistency using reduced bandwidth. Compared to unimodal baselines, it attains an average 23.6% improvement across metrics including FID and BLEU, validating the effectiveness and robustness of cross-modal semantic compression and joint reconstruction.
📝 Abstract
We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.