VoxAnchor: Grounding Speech Authenticity in Throat Vibration via mmWave Radar

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the vulnerability of existing voice authenticity detection methods to high-fidelity synthetic and edited attacks, which often rely on visual cues or wearable sensors. The authors propose a contactless voice authentication approach leveraging millimeter-wave radar, introducing laryngeal physiological vibrations as an unforgeable anchor tied directly to spoken content. This paradigm shifts beyond conventional liveness detection—which merely verifies whether a person is speaking—toward fine-grained verification of what is being said. The method employs phase-aware vibration extraction, modality-specific encoders, and a two-stage audio–radar alignment strategy combining signal-level onset detection with semantic-level consistency, all integrated within a cross-modal contrastive learning framework. Evaluated across diverse spoofing types, the system achieves an equal error rate (EER) of 0.017 while maintaining low latency and computational overhead.

Technology Category

Application Category

📝 Abstract

Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual-stage strategy that combines signal-level onset detection and semantic-level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word-level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine-grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.

Problem

Research questions and friction points this paper is trying to address.

speech authenticity

audio forgery detection

mmWave radar

throat vibration

deepfake

Innovation

Methods, ideas, or system contributions that make the work stand out.

mmWave radar

speech authentication

cross-modal learning