🤖 AI Summary
This work addresses the vulnerability of existing voice authenticity detection methods to high-fidelity synthetic and edited attacks, which often rely on visual cues or wearable sensors. The authors propose a contactless voice authentication approach leveraging millimeter-wave radar, introducing laryngeal physiological vibrations as an unforgeable anchor tied directly to spoken content. This paradigm shifts beyond conventional liveness detection—which merely verifies whether a person is speaking—toward fine-grained verification of what is being said. The method employs phase-aware vibration extraction, modality-specific encoders, and a two-stage audio–radar alignment strategy combining signal-level onset detection with semantic-level consistency, all integrated within a cross-modal contrastive learning framework. Evaluated across diverse spoofing types, the system achieves an equal error rate (EER) of 0.017 while maintaining low latency and computational overhead.
📝 Abstract
Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual-stage strategy that combines signal-level onset detection and semantic-level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word-level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine-grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.