🤖 AI Summary
Existing neural speech codecs often lose emotion-related cues at low bitrates due to reconstruction-oriented bit allocation and cross-stream information leakage, limiting the expressive capacity of downstream tasks. This work proposes AffectCodec, which introduces block-diagonal residual finite scalar quantization (BD-RFSQ)—a novel approach that explicitly decouples emotional and acoustic subspaces. By integrating multi-granularity emotion-conditioned modeling with a multi-rate training strategy, AffectCodec achieves structured emotion preservation while maintaining a flat token interface. The method shifts bit allocation from implicit loss-driven optimization to explicit structural constraints, significantly enhancing emotional fidelity at low bitrates without compromising acoustic quality or intelligibility.
📝 Abstract
Neural speech codecs have become the discrete interface between raw audio and speech language models, yet they remain optimized primarily for acoustic reconstruction fidelity, which leaves emotion-relevant cues vulnerable to being discarded during quantization, limiting the affective capacity of downstream models. We trace this degradation to two mechanisms: reconstruction-driven bit allocation under limited bitrate and cross-stream leakage in concatenation-based codecs, where acoustic gradients can overwrite nominally emotion-reserved dimensions. We propose AffectCodec, an emotion-preserving neural speech codec built on Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ). By imposing block-diagonal input and output projections over emotion and acoustic subspaces, BD-RFSQ transforms bit allocation from implicit and loss-driven to explicit and structurally guaranteed, while still preserving a flat token interface for downstream speech language models. AffectCodec further combines this structurally constrained quantizer with multi-granularity emotion conditioning and multi-rate training, enabling robust affect preservation at low bitrates. Experiments across multiple emotional speech benchmarks show that AffectCodec substantially improves emotion preservation, especially in the low-bitrate regime, while maintaining competitive acoustic quality and intelligibility. These results suggest that structurally protected quantization is an effective principle for preserving emotion-relevant information and may provide a general route toward attribute-aware neural speech compression.