🤖 AI Summary
This study addresses the challenge that existing deepfake detection methods operating at 16 kHz sampling rate struggle to identify forgery traces in high-fidelity synthetic singing voices due to the loss of high-frequency information. To overcome this limitation, the work presents the first systematic investigation into the role of high-resolution (44.1 kHz) audio for singing voice deepfake detection and proposes a novel framework that jointly models full-band and multiple sub-band expert branches. The full-band branch captures global contextual cues, while the sub-band experts are designed to specifically extract localized high-frequency artifacts that are unevenly distributed across the spectrum. Evaluated on the WildSVDD dataset, the proposed method significantly outperforms current 16 kHz approaches, demonstrating the effectiveness of leveraging high-sampling-rate audio combined with a sub-band fusion strategy for enhancing detection performance in real-world scenarios.
📝 Abstract
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.