๐ค AI Summary
Existing WiFi-based gesture recognition methods suffer from poor cross-scenario generalization and insufficient semantic representation due to high sensitivity to Channel State Information (CSI) domain variations and lack of high-level semantic abstraction. To address these limitations, we propose a large-model-aware semantic distillation and alignment framework that integrates semantic priors from pre-trained large language models. Our approach introduces a multi-scale semantic encoder and a cross-modal attention mechanism, jointly encoding CSI-ratio phase sequences and Doppler spectrograms via dual-path processing. We further design a semantic-aware soft supervision scheme and a joint distillation strategy for intermediate features and soft labels. Evaluated on the Widar3.0 benchmark, our method achieves significant improvements in recognition accuracy and cross-domain generalization, while reducing model size and inference latency. This work provides an efficient, privacy-preserving, contactless humanโcomputer interaction solution for AIoT applications.
๐ Abstract
WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for enabling non-contact and privacy-preserving human-computer interaction in AIoT environments. However, existing methods often suffer from limited generalization and semantic expressiveness due to the domain-sensitive nature of Channel State Information and the lack of high-level gesture abstraction. To address these challenges, we propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment (GLSDA), which leverages the semantic prior of pre-trained large foundation models to enhance gesture representation learning in both in-domain and cross-domain scenarios. Specifically, we first design a dual-path CSI encoding pipeline that captures geometric and dynamic gesture patterns via CSI-Ratio phase sequences and Doppler spectrograms. These representations are then fed into a Multiscale Semantic Encoder, which learns robust temporal embeddings and aligns them with gesture semantics through cross-modal attention mechanisms. To further enhance category discrimination, we introduce a Semantic-Aware Soft Supervision scheme that encodes inter-class correlations and reduces label ambiguity, especially for semantically similar gestures. Finally, we develop a Robust Dual-Distillation strategy to compress the aligned model into a lightweight student network, jointly distilling intermediate features and semantic-informed soft labels from the teacher model. Extensive experiments on the Widar3.0 benchmark show that GLSDA consistently outperforms state-of-the-art methods in both in-domain and cross-domain gesture recognition tasks, while significantly reducing model size and inference latency. Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.