🤖 AI Summary
This work addresses the challenge posed by human-imitated speech, which exhibits high naturalness and thus evades detection by conventional acoustic features, thereby threatening voice authentication systems. Inspired by human auditory mechanisms, the study proposes a spectro-temporal modulation (STM) representation based on cochlear filters—specifically Gammatone and Gammachirp filters—and innovatively introduces segmented STM to capture short-term speech dynamics with high resolution. This approach represents the first application of auditory-inspired STM features to imitation speech detection. Evaluated on public datasets, the method achieves detection accuracy comparable to human auditory performance, while the segmented STM variant further surpasses human perceptual capabilities, significantly enhancing anti-spoofing robustness in voice authentication.
📝 Abstract
Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly more challenging to detect using conventional acoustic or cepstral features. To overcome this challenge, this study proposes an auditory perception-based Spectro-Temporal Modulation (STM) representation framework for human-imitated speech detection. The STM representations are derived from two cochlear filterbank models: the Gammatone Filterbank (GTFB), which simulates frequency selectivity and can be regarded as a first approximation of cochlear filtering, and the Gammachirp Filterbank (GCFB), which further models both frequency selectivity and level-dependent asymmetry. These STM representations jointly capture temporal and spectral fluctuations in speech signals, corresponding to changes over time in the spectrogram and variations along the frequency axis related to human auditory perception. We also introduce a Segmental-STM representation to analyze short-term modulation patterns across overlapping time windows, enabling high-resolution modeling of temporal speech variations. Experimental results show that STM representations are effective for human-imitated speech detection, achieving accuracy levels close to those of human listeners. In addition, Segmental-STM representations are more effective, surpassing human perceptual performance. The findings demonstrate that perceptually inspired spectro-temporal modeling is promising for detecting imitation-based speech attacks and improving voice authentication robustness.