🤖 AI Summary
To address the challenge of jointly achieving silent speech recognition and speaker authentication in silent speech interfaces (SSIs), this paper proposes HEar-ID: an end-to-end joint modeling framework leveraging off-the-shelf active noise-cancelling earbuds. HEar-ID simultaneously captures low-frequency “whisper” audio from the ear canal and high-frequency ultrasonic echo signals, employing a lightweight shared encoder for multi-task learning, augmented by contrastive learning and cross-modal feature alignment. To our knowledge, it is the first approach to concurrently achieve 50-word-level silent spelling recognition and biometric-key-level speaker authentication on a single device with a single model—requiring no additional hardware or explicit user cooperation. Experiments demonstrate that HEar-ID maintains high spelling accuracy while significantly improving impostor rejection performance. This work establishes a new paradigm for seamless, privacy-preserving authentication in sensitive applications.
📝 Abstract
Silent speech interface (SSI) enables hands-free input without audible vocalization, but most SSI systems do not verify speaker identity. We present HEar-ID, which uses consumer active noise-canceling earbuds to capture low-frequency "whisper" audio and high-frequency ultrasonic reflections. Features from both streams pass through a shared encoder, producing embeddings that feed a contrastive branch for user authentication and an SSI head for silent spelling recognition. This design supports decoding of 50 words while reliably rejecting impostors, all on commodity earbuds with a single model. Experiments demonstrate that HEar-ID achieves strong spelling accuracy and robust authentication.