🤖 AI Summary
Visual gesture recognition in VR incurs high computational overhead, suffers from lighting sensitivity and privacy risks, while existing acoustic approaches (e.g., CIR-based methods) rely heavily on labeled data and generalize poorly to few-shot scenarios. Method: This paper proposes the first large language model (LLM)-based acoustic gesture recognition framework. It leverages micro-Doppler effects to capture acoustic field perturbations induced by hand motions and employs differential channel impulse response (CIR) acquisition for low-power, privacy-preserving interaction. Contribution/Results: By innovatively integrating LLMs into acoustic gesture recognition, the framework enables few-shot and zero-shot classification without domain-specific fine-tuning. Evaluated on an empirical dataset comprising 15 gestures and 10 participants, it achieves accuracy comparable to supervised baselines—without task-specific adaptation—while significantly improving cross-user and cross-scenario generalization.
📝 Abstract
Natural and efficient interaction remains a critical challenge for virtual reality and augmented reality (VR/AR) systems. Vision-based gesture recognition suffers from high computational cost, sensitivity to lighting conditions, and privacy leakage concerns. Acoustic sensing provides an attractive alternative: by emitting inaudible high-frequency signals and capturing their reflections, channel impulse response (CIR) encodes how gestures perturb the acoustic field in a low-cost and user-transparent manner. However, existing CIR-based gesture recognition methods often rely on extensive training of models on large labeled datasets, making them unsuitable for few-shot VR scenarios. In this work, we propose the first framework that leverages large language models (LLMs) for CIR-based gesture recognition in VR/AR systems. Despite LLMs'strengths, it is non-trivial to achieve few-shot and zero-shot learning of CIR gestures due to their inconspicuous features. To tackle this challenge, we collect differential CIR rather than original CIR data. Moreover, we construct a real-world dataset collected from 10 participants performing 15 gestures across three categories (digits, letters, and shapes), with 10 repetitions each. We then conduct extensive experiments on this dataset using an LLM-adopted classifier. Results show that our LLM-based framework achieves accuracy comparable to classical machine learning baselines, while requiring no domain-specific retraining.