🤖 AI Summary
This study investigates human ability to distinguish AI-synthesized speech from human speech in realistic voice phishing (vishing) scenarios and examines the perceptual strategies employed. In an online experiment, 315 participants classified 16 vishing audio clips as either human- or AI-generated and reported their confidence levels. Integrating signal detection theory with thematic analysis, the research reveals a pervasive tendency toward misclassification in authentic attack contexts: average accuracy was only 37.5%, significantly below chance level, and errors were often made with high confidence. The findings indicate that conventional paralinguistic cues of vocal authenticity are effectively mimicked by current AI systems, thereby challenging prevailing models of auditory authenticity perception, and identify several auditory heuristics participants rely on when making judgments.
📝 Abstract
Large Language Models and commercial speech synthesis systems now enable highly realistic AI-generated voice scams (vishing), raising urgent concerns about deception at scale. Yet it remains unclear whether individuals can reliably distinguish AI-generated speech from human-recorded voices in realistic scam contexts and what perceptual strategies underlie their judgments. We conducted a controlled online study in which 22 participants evaluated 16 vishing-style audio clips (8 AI-generated, 8 human-recorded) and classified each as human or AI while reporting confidence. Participants performed poorly: mean accuracy was 37.5%, below chance in a binary classification task. At the stimulus level, misclassification was bidirectional: 75% of AI-generated clips were majority-labeled as human, while 62.5% of human-recorded clips were majority-labeled as AI. Signal Detection Theory analysis revealed near-zero discriminability (d' approx 0), indicating inability to reliably distinguish synthetic from human voices rather than simple response bias. Qualitative analysis of 315 coded excerpts revealed reliance on paralinguistic and emotional heuristics, including pauses, filler words, vocal variability, cadence, and emotional expressiveness. However, these surface-level cues traditionally associated with human authenticity were frequently replicated by AI-generated samples. Misclassifications were often accompanied by moderate to high confidence, suggesting perceptual miscalibration rather than uncertainty. Together, our findings demonstrate that authenticity judgments based on vocal heuristics are unreliable in contemporary vishing scenarios. We discuss implications for security interventions, user education, and AI-mediated deception mitigation.