AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

πŸ“… 2024-02-28
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Real-time deepfake (RTDF) voice synthesis powered by AI is increasingly exploited in highly targeted social engineering attacks to circumvent telephonic authentication systems, resulting in substantial financial losses. To address this, we propose PITCHβ€”a challenge-response framework that introduces the first audio challenge taxonomy integrating auditory-physiological, linguistic, and environmental features, enabling human-AI collaborative real-time risk assessment. Evaluated on a diverse 1.6M-sample multi-source deepfake speech dataset, our method achieves 87.7% accuracy (on balanced subsets) and 88.7% AUROC (on the full imbalanced set) for automated detection; human-AI collaboration further improves accuracy to 84.5%, substantially outperforming unimodal baselines. The system incorporates early-warning labeling and an open-source interactive interface, delivering the first explainable, low-latency, and robust real-time defense against RTDF. This work establishes a novel paradigm for voice-based identity authentication under adversarial conditions.

Technology Category

Application Category

πŸ“ Abstract
The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enrollment-based authentication. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. These were tested against leading voice-cloning systems using a novel dataset comprising 18,600 original and 1.6 million deepfake samples from 100 users. PITCH's prospective challenges enhanced machine detection capabilities to 88.7% AUROC score on the full unbalanced dataset, enabling us to shortlist 10 functional challenges that balance security and usability. For human evaluation and subsequent analyses, we filtered a challenging, balanced subset. On this subset, human evaluators independently scored 72.6% accuracy, while machines achieved 87.7%. Acknowledging that call environments require higher human control, we aided call receivers in making decisions with them using machines. Our solution uses an early warning system to tag suspicious incoming calls as"Deepfake-likely."Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages. Our solution gave users maximum control and boosted detection accuracy to 84.5%. Evidenced by this jump in accuracy, PITCH demonstrated the potential for AI-assisted pre-screening in call verification processes, offering an adaptable and usable approach to combat real-time voice-cloning attacks. Code to reproduce and access data at url{https://github.com/mittalgovind/PITCH-Deepfakes}.
Problem

Research questions and friction points this paper is trying to address.

Detecting real-time deepfake audio in phone calls
Combating AI voice-cloning social engineering attacks
Enhancing human-AI collaboration for call authentication
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-assisted challenge-response deepfake detection
Human-AI collaborative tagging system
Comprehensive auditory and linguistic challenge taxonomy
πŸ”Ž Similar Papers
No similar papers found.
Govind Mittal
Govind Mittal
New York University, Tandon School of Engineering, Brooklyn, NY
A
Arthur Jakobsson
Carnegie Mellon University, Pittsburgh, PA
Kelly O. Marshall
Kelly O. Marshall
Ph.D. Candidate, NYU
Deep Learning3D Machine LearningGenerative Modeling
Chinmay Hegde
Chinmay Hegde
New York University
AI
N
Nasir D. Memon
New York University, Tandon School of Engineering, Brooklyn, NY