Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Bandwidth extension (BWE) aims to reconstruct high-frequency components from low-pass audio, a classic yet challenging audio generation task. This paper proposes a decoupled neural codec framework tailored for generative modeling, which—uniquely—integrates harmonic–percussive source separation (HPSS) into an end-to-end audio codec to explicitly disentangle and jointly optimize harmonic and percussive features. The method combines discrete audio tokenization, a Transformer-based language model for autoregressive high-frequency token prediction, and a joint training strategy that enhances representation learning and reconstruction consistency. Evaluated on multiple benchmark datasets, our approach achieves significant improvements over state-of-the-art methods in both objective metrics (PESQ, STOI) and subjective listening tests (MOS), demonstrating superior performance in high-fidelity high-frequency reconstruction and perceptual quality preservation.

Technology Category

Application Category

📝 Abstract
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing high-frequency audio from low-pass signals
Framing bandwidth extension as audio token prediction
Designing disentangled neural codec guided by harmonic-percussive decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based language model for audio token prediction
Disentangled neural codec with Harmonic-Percussive decomposition
Joint codec-transformer design for enhanced bandwidth extension
🔎 Similar Papers
No similar papers found.