AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

📅 2024-02-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Real-time deepfake (RTDF) voice synthesis powered by AI is increasingly exploited in highly targeted social engineering attacks to circumvent telephonic authentication systems, resulting in substantial financial losses. To address this, we propose PITCH—a challenge-response framework that introduces the first audio challenge taxonomy integrating auditory-physiological, linguistic, and environmental features, enabling human-AI collaborative real-time risk assessment. Evaluated on a diverse 1.6M-sample multi-source deepfake speech dataset, our method achieves 87.7% accuracy (on balanced subsets) and 88.7% AUROC (on the full imbalanced set) for automated detection; human-AI collaboration further improves accuracy to 84.5%, substantially outperforming unimodal baselines. The system incorporates early-warning labeling and an open-source interactive interface, delivering the first explainable, low-latency, and robust real-time defense against RTDF. This work establishes a novel paradigm for voice-based identity authentication under adversarial conditions.

Technology Category

Application Category

📝 Abstract

The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enrollment-based authentication. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. These were tested against leading voice-cloning systems using a novel dataset comprising 18,600 original and 1.6 million deepfake samples from 100 users. PITCH's prospective challenges enhanced machine detection capabilities to 88.7% AUROC score on the full unbalanced dataset, enabling us to shortlist 10 functional challenges that balance security and usability. For human evaluation and subsequent analyses, we filtered a challenging, balanced subset. On this subset, human evaluators independently scored 72.6% accuracy, while machines achieved 87.7%. Acknowledging that call environments require higher human control, we aided call receivers in making decisions with them using machines. Our solution uses an early warning system to tag suspicious incoming calls as"Deepfake-likely."Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages. Our solution gave users maximum control and boosted detection accuracy to 84.5%. Evidenced by this jump in accuracy, PITCH demonstrated the potential for AI-assisted pre-screening in call verification processes, offering an adaptable and usable approach to combat real-time voice-cloning attacks. Code to reproduce and access data at url{https://github.com/mittalgovind/PITCH-Deepfakes}.

Problem

Research questions and friction points this paper is trying to address.

Detecting real-time deepfake audio in phone calls

Combating AI voice-cloning social engineering attacks

Enhancing human-AI collaboration for call authentication

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-assisted challenge-response deepfake detection

Human-AI collaborative tagging system

Comprehensive auditory and linguistic challenge taxonomy

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection