🤖 AI Summary
This work presents the first Wi-Fi backscatter-based system for open-vocabulary silent speech recognition, addressing the limitations of existing approaches that are typically confined to closed vocabularies and face trade-offs among privacy, user experience, and power consumption. By leveraging frequency-shifted backscatter tags to capture lip-motion signals, the proposed method integrates self-supervised representation learning with a lexicon-guided Transformer decoder to effectively suppress interference and enhance semantic coherence. Evaluated on a test set comprising 340 sentences and 3,398 words, the system achieves a word accuracy of 85.61% and a word error rate (WER) of 36.87%, approaching the performance of vision-based lip-reading systems. This advancement significantly advances low-power, privacy-preserving silent interaction in open-domain scenarios.
📝 Abstract
Silent speech interfaces (SSIs) enable silent interaction in noise-sensitive or privacy-sensitive settings. However, existing SSIs face practical deployment trade-offs among privacy, user experience, and energy consumption, and most remain limited to closed-set recognition over small, pre-defined vocabularies of words or sentences, which restricts real-world expressiveness. In this paper, we present Lip-Siri, to the best of our knowledge, the first Wi-Fi backscatter--based SSI that supports open-vocabulary sentence recognition via lexicon-guided subword decoding. Lip-Siri designs a frequency-shifted backscatter tag to isolate tag-modulated reflections and suppress interference from non-target motions, enabling reliable extraction of lip-motion traces from ubiquitous Wi-Fi signals. We then segment continuous traces into lip-motion units, cluster them, learn robust unit representations via cluster-based self-supervision, and finally propose a lexicon-guided Transformer encoder--decoder with beam search to decode variable-length sentence sequences. We implement an end-to-end prototype and evaluate it with 15 participants on 340 sentences and 3,398 words across multiple scenarios. Lip-Siri achieves 85.61% accuracy on word prediction and a WER of 36.87% on continuous sentence recognition, approaching the performance of representative vision-based lip-reading systems.