AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Existing neural speech codecs often lose emotion-related cues at low bitrates due to reconstruction-oriented bit allocation and cross-stream information leakage, limiting the expressive capacity of downstream tasks. This work proposes AffectCodec, which introduces block-diagonal residual finite scalar quantization (BD-RFSQ)—a novel approach that explicitly decouples emotional and acoustic subspaces. By integrating multi-granularity emotion-conditioned modeling with a multi-rate training strategy, AffectCodec achieves structured emotion preservation while maintaining a flat token interface. The method shifts bit allocation from implicit loss-driven optimization to explicit structural constraints, significantly enhancing emotional fidelity at low bitrates without compromising acoustic quality or intelligibility.
📝 Abstract
Neural speech codecs have become the discrete interface between raw audio and speech language models, yet they remain optimized primarily for acoustic reconstruction fidelity, which leaves emotion-relevant cues vulnerable to being discarded during quantization, limiting the affective capacity of downstream models. We trace this degradation to two mechanisms: reconstruction-driven bit allocation under limited bitrate and cross-stream leakage in concatenation-based codecs, where acoustic gradients can overwrite nominally emotion-reserved dimensions. We propose AffectCodec, an emotion-preserving neural speech codec built on Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ). By imposing block-diagonal input and output projections over emotion and acoustic subspaces, BD-RFSQ transforms bit allocation from implicit and loss-driven to explicit and structurally guaranteed, while still preserving a flat token interface for downstream speech language models. AffectCodec further combines this structurally constrained quantizer with multi-granularity emotion conditioning and multi-rate training, enabling robust affect preservation at low bitrates. Experiments across multiple emotional speech benchmarks show that AffectCodec substantially improves emotion preservation, especially in the low-bitrate regime, while maintaining competitive acoustic quality and intelligibility. These results suggest that structurally protected quantization is an effective principle for preserving emotion-relevant information and may provide a general route toward attribute-aware neural speech compression.
Problem

Research questions and friction points this paper is trying to address.

neural speech codec
emotion preservation
quantization
affective capacity
bit allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion-preserving codec
block-diagonal quantization
neural speech compression
affective speech modeling
finite scalar quantization
🔎 Similar Papers
No similar papers found.