Vision Transformer Based Semantic Communications for Next Generation Wireless Networks

πŸ“… 2025-03-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the fundamental trade-off between bandwidth constraints and semantic fidelity in 6G semantic communication for image transmission, this paper proposes the first end-to-end semantic coding framework based on Vision Transformers (ViT). Methodologically, it deeply integrates ViT’s global self-attention mechanism into a joint source-channel coding architecture, enabling attention-driven semantic feature extraction and robust reconstruction, while explicitly modeling realistic wireless fading and additive noise channels. Its key contribution lies in departing from conventional rate-distortion optimization, instead prioritizing semantic similarity as the primary objective. Experimental results demonstrate that the framework achieves a PSNR of 38 dB across diverse channel conditions and significantly outperforms CNN- and GAN-based baselines in semantic similarity metrics. These findings validate ViT’s effectiveness and generalizability for high-fidelity semantic image transmission under stringent bandwidth limitations.

Technology Category

Application Category

πŸ“ Abstract
In the evolving landscape of 6G networks, semantic communications are poised to revolutionize data transmission by prioritizing the transmission of semantic meaning over raw data accuracy. This paper presents a Vision Transformer (ViT)-based semantic communication framework that has been deliberately designed to achieve high semantic similarity during image transmission while simultaneously minimizing the demand for bandwidth. By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content at the transmitter and precisely reconstruct the images, considering real-world fading and noise consideration at the receiver. Building on the attention mechanisms inherent to ViTs, our model outperforms Convolution Neural Network (CNNs) and Generative Adversarial Networks (GANs) tailored for generating such images. The architecture based on the proposed ViT network achieves the Peak Signal-to-noise Ratio (PSNR) of 38 dB, which is higher than other Deep Learning (DL) approaches in maintaining semantic similarity across different communication environments. These findings establish our ViT-based approach as a significant breakthrough in semantic communications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing image transmission semantic similarity in 6G networks
Reducing bandwidth demand for semantic communication systems
Outperforming CNNs and GANs in noisy wireless environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

ViT-based semantic communication framework
High semantic similarity with minimal bandwidth
Outperforms CNNs and GANs in PSNR
πŸ”Ž Similar Papers
No similar papers found.