Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
DF-Conformer’s FAVOR+ approximate attention mechanism compromises global sequence modeling capability and struggles to balance accuracy with linear computational complexity. To address this, we propose Hydra-Genhancer: a novel generative speech enhancement framework that replaces FAVOR+ with a bidirectional selective structured state space model (Hydra), eliminating approximation errors while preserving O(L) time complexity and significantly improving long-range dependency modeling. Hydra is integrated into the Genhancer architecture to enable efficient, high-fidelity reconstruction over discrete codec token sequences. Experiments demonstrate consistent and substantial improvements over DF-Conformer across objective speech quality metrics—PESQ and STOI—as well as naturalness scores, particularly under low signal-to-noise ratio conditions. Hydra-Genhancer establishes a new paradigm for lightweight, generative speech enhancement by unifying theoretical rigor, computational efficiency, and perceptual fidelity.

Technology Category

Application Category

📝 Abstract
The Dilated FAVOR Conformer (DF-Conformer) is an efficient variant of the Conformer architecture designed for speech enhancement (SE). It employs fast attention through positive orthogonal random features (FAVOR+) to mitigate the quadratic complexity associated with self-attention, while utilizing dilated convolution to expand the receptive field. This combination results in impressive performance across various SE models. In this paper, we propose replacing FAVOR+ with bidirectional selective structured state-space sequence models to achieve two main objectives:(1) enhancing global sequential modeling by eliminating the approximations inherent in FAVOR+, and (2) maintaining linear complexity relative to the sequence length. Specifically, we utilize Hydra, a bidirectional extension of Mamba, framed within the structured matrix mixer framework. Experiments conducted using a generative SE model on discrete codec tokens, known as Genhancer, demonstrate that the proposed method surpasses the performance of the DF-Conformer.
Problem

Research questions and friction points this paper is trying to address.

Enhancing global sequential modeling in speech enhancement
Maintaining linear complexity relative to sequence length
Improving generative speech enhancement on discrete codec tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaced FAVOR+ with bidirectional selective state-space models
Used Hydra extension of Mamba for global sequence modeling
Maintained linear complexity while enhancing speech enhancement performance
🔎 Similar Papers
No similar papers found.
Shogo Seki
Shogo Seki
CyberAgent, Inc.
Acoustic signal processing
S
Shaoxiang Dang
AI Lab, CyberAgent, Tokyo, Japan
L
Li Li
AI Lab, CyberAgent, Tokyo, Japan