🤖 AI Summary
This work addresses the challenge of generating fine-grained natural language descriptions from Wi-Fi channel state information (CSI), which is hindered by the semantic gap between wireless signals and linguistic representations, as well as ambiguity in left–right limb orientation. To overcome these limitations, the authors propose WiFi2Cap, a three-stage framework: first, a vision–language teacher model extracts transferable supervisory signals from synchronized video–text pairs to guide a CSI-based student model in aligning visual-spatial features with textual embeddings; second, a mirror consistency loss is introduced to mitigate directional ambiguity; and third, a prefix-tuned language model generates action descriptions. The study introduces WiFi2Cap, the first synchronized CSI–RGB–sentence benchmark dataset, and demonstrates significant improvements over baseline methods across BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics, validating its efficacy in privacy-preserving semantic perception.
📝 Abstract
Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher's visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A prefix-tuned language model then generates action descriptions from CSI embeddings. We also introduce the WiFi2Cap Dataset, a synchronized CSI-RGB-sentence benchmark for semantic captioning from Wi-Fi signals. Experimental results show that WiFi2Cap consistently outperforms baseline methods on BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE, demonstrating effective privacy-friendly semantic sensing.