VoxAnchor: Grounding Speech Authenticity in Throat Vibration via mmWave Radar

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of existing voice authenticity detection methods to high-fidelity synthetic and edited attacks, which often rely on visual cues or wearable sensors. The authors propose a contactless voice authentication approach leveraging millimeter-wave radar, introducing laryngeal physiological vibrations as an unforgeable anchor tied directly to spoken content. This paradigm shifts beyond conventional liveness detection—which merely verifies whether a person is speaking—toward fine-grained verification of what is being said. The method employs phase-aware vibration extraction, modality-specific encoders, and a two-stage audio–radar alignment strategy combining signal-level onset detection with semantic-level consistency, all integrated within a cross-modal contrastive learning framework. Evaluated across diverse spoofing types, the system achieves an equal error rate (EER) of 0.017 while maintaining low latency and computational overhead.
📝 Abstract
Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual-stage strategy that combines signal-level onset detection and semantic-level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word-level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine-grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.
Problem

Research questions and friction points this paper is trying to address.

speech authenticity
audio forgery detection
mmWave radar
throat vibration
deepfake
Innovation

Methods, ideas, or system contributions that make the work stand out.

mmWave radar
speech authentication
cross-modal learning
throat vibration
deepfake detection
🔎 Similar Papers
No similar papers found.