🤖 AI Summary
This work addresses the challenge of zero-shot understanding of unsegmented human activities using channel state information (CSI), a task hindered by the reliance of existing methods on precise signal segmentation and predefined action labels. To overcome this limitation, the authors propose a language-driven zero-shot sensing framework that leverages a CSI-to-Language adapter and a cross-modal projection mechanism to map raw temporal CSI features end-to-end into a semantic space aligned with large language models, enabling direct generation of fine-grained natural language descriptions. This approach is the first to achieve zero-shot human activity understanding without requiring segmented training data, supporting overlapping action disentanglement and language-based reasoning while effectively bridging the modality gap and handling ambiguous action boundaries. Experiments demonstrate 92% accuracy and 91% F1 score in zero-shot action recognition, with 30% and 15% improvements in factual correctness and reasoning capability of generated language, respectively, and an average 12.33% performance gain over existing methods in multi-person activity interpretation.
📝 Abstract
There is growing interest in enabling wireless sensing systems to interpret human motion from unsegmented wireless signals; however, existing CSI-based applications rely heavily on accurate signal segmentation and predefined action labels, limiting their applicability in zero-shot scenarios. We present WirelessSenseLLM, a language-driven framework that leverages large language models (LLMs) to enable zero-shot human motion understanding from unsegmented Wi-Fi Channel State Information (CSI). To bridge the modality gap between time-series CSI and discrete language representations, we introduce a CSI-to-Language Adapter and a cross-modal projection mechanism that maps CSI features into a language-aligned semantic space. This design enables the generation of fine-grained natural language descriptions of sequential and overlapping human motions, supporting downstream reasoning without segmented training data. We address two core technical challenges: modality mismatch between CSI features and language embeddings, and overlapping actions in unsegmented CSI streams. Extensive experiments demonstrate strong performance in zero-shot action understanding (92% accuracy and 91% F1-score), language-based reasoning quality (30% factual and 15% reasoning improvements), and multi-person motion explanation with an average 12.33% improvement over prior methods. These results highlight WirelessSenseLLM's effectiveness for robust and interpretable human motion understanding from CSI signals.