I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses the limited understanding of how humans detect synthetic speech in real-world sociotechnical contexts and the factors influencing this ability. For the first time, trust-related cues—such as instructional framing, affective priming, and source labeling—are integrated into a synthetic speech detection task. Through controlled user experiments combining explicit detection judgments, multidimensional perceptual quality ratings (e.g., mechanicalness, expressiveness), and systematic manipulation of trust cues, the research examines participants’ detection behavior and quality assessments across fully synthetic, partially synthetic, and genuine human voices. Results reveal that voice type is the primary determinant of detection accuracy, with fully synthetic speech identified at below-chance levels. Although trust cues show no main effect on accuracy, they enhance detection motivation. Crucially, perceptual quality ratings uncover implicit discrimination capabilities, highlighting a dissociation between explicit judgments and implicit perception.

📝 Abstract

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.

Problem

Research questions and friction points this paper is trying to address.

synthetic speech

deepfake detection

human perception

trust cues

socio-technical environment

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic speech detection

socio-technical

trust cues