Towards Neural Audio Codec Source Parsing

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio deepfake (codecfake) detection methods exhibit two critical limitations in neural audio codec (NAC) provenance attribution: poor generalization to unseen NACs and inability to infer fine-grained internal configurations (e.g., quantizer design, bandwidth, sampling rate). Method: This paper pioneers modeling NAC provenance as a structured parameter regression task. We propose NACSP—a novel paradigm for interpretable configuration prediction on unknown NACs—and HYDRA, a framework integrating multi-curvature hyperbolic subspace modeling with task-aware attention to disentangle latent representations. Contribution/Results: Leveraging features from pretrained speech models, HYDRA significantly outperforms Euclidean baselines on mainstream codecfake benchmarks. It achieves, for the first time, fine-grained, interpretable, and strongly generalizable NAC attribution—accurately identifying both codec identity and internal configuration parameters across previously unseen NACs.

Technology Category

Application Category

📝 Abstract
A new class of audio deepfakes-codecfakes (CFs)-has recently caught attention, synthesized by Audio Language Models that leverage neural audio codecs (NACs) in the backend. In response, the community has introduced dedicated benchmarks and tailored detection strategies. As the field advances, efforts have moved beyond binary detection toward source attribution, including open-set attribution, which aims to identify the NAC responsible for generation and flag novel, unseen ones during inference. This shift toward source attribution improves forensic interpretability and accountability. However, open-set attribution remains fundamentally limited: while it can detect that a NAC is unfamiliar, it cannot characterize or identify individual unseen codecs. It treats such inputs as generic ``unknowns'', lacking insight into their internal configuration. This leads to major shortcomings: limited generalization to new NACs and inability to resolve fine-grained variations within NAC families. To address these gaps, we propose Neural Audio Codec Source Parsing (NACSP) - a paradigm shift that reframes source attribution for CFs as structured regression over generative NAC parameters such as quantizers, bandwidth, and sampling rate. We formulate NACSP as a multi-task regression task for predicting these NAC parameters and establish the first comprehensive benchmark using various state-of-the-art speech pre-trained models (PTMs). To this end, we propose HYDRA, a novel framework that leverages hyperbolic geometry to disentangle complex latent properties from PTM representations. By employing task-specific attention over multiple curvature-aware hyperbolic subspaces, HYDRA enables superior multi-task generalization. Our extensive experiments show HYDRA achieves top results on benchmark CFs datasets compared to baselines operating in Euclidean space.
Problem

Research questions and friction points this paper is trying to address.

Detect and attribute neural audio codecs in deepfakes
Overcome limitations of open-set attribution for unknown codecs
Predict generative NAC parameters via structured regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured regression over NAC parameters
Hyperbolic geometry for latent disentanglement
Multi-task attention in hyperbolic subspaces
🔎 Similar Papers