Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the underutilization of multi-granularity acoustic cues embedded in the hierarchical structure of residual vector quantization (RVQ) within neural audio codecs for deepfake speech detection. The authors propose a hierarchy-aware representation learning framework that explicitly models acoustic characteristics across RVQ quantization levels for the first time, complemented by a learnable global weighting mechanism to dynamically fuse coarse-grained structural information with fine-grained synthesis artifacts. Notably, the approach achieves parameter-efficient optimization by fine-tuning only 4.4% of the model parameters while keeping the encoder backbone frozen. Evaluated on the ASVspoof 2019 and ASVspoof5 datasets, the method reduces the equal error rate (EER) by 46.2% and 13.9% relative to strong baselines, respectively, demonstrating significant performance gains.

Technology Category

Application Category

📝 Abstract

Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantization levels capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer-level hierarchy. We propose a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. Keeping the speech encoder backbone frozen and updating only 4.4% additional parameters, our method achieves relative EER reductions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 over strong baselines.

Problem

Research questions and friction points this paper is trying to address.

speech deepfake detection

neural audio codec

residual vector quantization

quantizer hierarchy

synthesis artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

quantizer-aware

hierarchical neural codec

residual vector quantization