🤖 AI Summary
This work addresses the unclear effectiveness of existing time-axis distillation–based alignment methods in speech variational autoencoders (VAEs) across the triad of reconstruction, semantic understanding, and generation tasks. For the first time, it systematically evaluates multiple distillation alignment strategies within a unified framework and proposes a novel approach that jointly aligns marginal distributions and employs an adaptive weighting mechanism to achieve optimal overall performance while enabling controllable trade-offs among tasks. Experimental results demonstrate that the proposed method consistently attains superior综合 performance across all three task categories and allows flexible adjustment of task-specific weights according to application requirements.
📝 Abstract
Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.