🤖 AI Summary
This work addresses the cross-modal reconstruction problem of generating physical environment images from WiFi channel state information (CSI). We propose an efficient, high-resolution imaging method based on a pre-trained latent diffusion model (LDM). Our approach introduces two key innovations: (1) a lightweight neural network that directly maps raw CSI amplitude features into the LDM’s latent space—bypassing conventional pixel-level generation and explicit image encoding; and (2) text-conditioned latent-space denoising to enable semantically controllable image synthesis. Evaluated on a custom broadband CSI dataset and the MM-Fi subset, our method outperforms existing baselines under comparable computational complexity: it achieves 3.2× faster inference than pixel-level diffusion models, improves perceptual quality (reducing FID by 27.6%), and supports interpretable, text-driven semantic control. This establishes a novel camera-free paradigm for environmental sensing via WiFi CSI.
📝 Abstract
We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM's denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM's pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.