๐ค AI Summary
Existing semantic communication systems predominantly rely on coupled autoencoder architectures, which struggle to model true signal distributions and suffer from poor scalability. This paper introduces diffusion autoencoders to wireless semantic communication for the first time, proposing a โsemantic latent variable + conditional diffusion (CDiff)โ framework: a neural semantic encoder extracts high-dimensional semantic representations at the transmitter, while the receiver reconstructs the signal space via a score-based denoising process conditioned on the semantic latent variable. This design decouples encoding and decoding, theoretically guaranteeing the decoderโs consistent estimation of the true data distribution. Experiments on CIFAR-10 and MNIST demonstrate significantly superior reconstruction performance over conventional autoencoders and VAEs. The framework further extends successfully to multi-user scenarios, identifying channel distortion and semantic prior mismatch as dominant practical impairments.
๐ Abstract
Semantic communication (SemCom) systems aim to learn the mapping from low-dimensional semantics to high-dimensional ground-truth. While this is more akin to a "domain translation" problem, existing frameworks typically emphasize on channel-adaptive neural encoding-decoding schemes, lacking full exploration of signal distribution. Moreover, such methods so far have employed autoencoder-based architectures, where the encoding is tightly coupled to a matched decoder, causing scalability issues in practice. To address these gaps, diffusion autoencoder models are proposed for wireless SemCom. The goal is to learn a "semantic-to-clean" mapping, from the semantic space to the ground-truth probability distribution. A neural encoder at semantic transmitter extracts the high-level semantics, and a conditional diffusion model (CDiff) at the semantic receiver exploits the source distribution for signal-space denoising, while the received semantic latents are incorporated as the conditioning input to "steer" the decoding process towards the semantics intended by the transmitter. It is analytically proved that the proposed decoder model is a consistent estimator of the ground-truth data. Furthermore, extensive simulations over CIFAR-10 and MNIST datasets are provided along with design insights, highlighting the performance compared to legacy autoencoders and variational autoencoders (VAE). Simulations are further extended to the multi-user SemCom, identifying the dominating factors in a more realistic setup.