🤖 AI Summary
This work proposes the first fully audio-layer-based real-time caller authentication system designed for environments lacking network connectivity or trusted infrastructure. By embedding a 40-millisecond frame-level neural watermark into speech, the system transforms the voice channel into a low-bitrate serial communication link. It integrates a lightweight symmetric key agreement protocol with a custom-designed data link layer mechanism—including frame synchronization, error control, and selective retransmission—to enable end-to-end authentication without relying on speech recognition or network support. The system achieves over 99.2% authentication success rate under clean speech conditions and maintains above 95% under common distortions, while preserving high voice quality (PESQ > 4.2, STOI > 0.94) and requiring an average authentication time of 63 seconds, demonstrating practical deployability.
📝 Abstract
We present CallShield, the first caller identity authentication system that operates entirely at the audio layer, without relying on speech transcription, internet connectivity, or trusted infrastructure. CallShield introduces a real-time neural watermarking technique that enables per-bit embedding and recovery within 40-millisecond frames of live 8 kHz speech. This capability allows CallShield to transform the real-time audio channel into a noisy serial communication medium. To ensure reliable data transmission, CallShield implements a low-bitrate data link protocol that provides basic frame synchronization along with error detection, correction, and recovery. For caller authentication, CallShield adopts a secure and lightweight symmetric-key protocol that relies on pairwise shared secrets among trusted contacts. The system completes the full authentication process in an average of 63 seconds, including up to three retransmission attempts, making it suitable for real-time deployment. Extensive experiments under realistic telephony conditions demonstrate that CallShield achieves an overall authentication success rates exceeding 99.2% on clean audio and over 95% under common distortions, aided by selective retransmission of failed messages. Additionally, CallShield maintains high audio quality, achieving PESQ scores above 4.2 and STOI scores above 0.94 on clean speech, and exhibits robustness across a wide range of channel distortions, validating its practical viability for secure, real-time caller authentication.