🤖 AI Summary
Diffusion-based speech enhancement achieves high naturalness and generalization but suffers from prominent generation artifacts and high inference latency. To address these limitations, this paper proposes an artifact-aware semantic-consistency ensembled diffusion framework. First, it introduces phoneme-level artifact prediction via variance estimation of speech embeddings. Second, it designs a semantic-consistency-guided multi-path diffusion ensemble mechanism that fuses multi-step denoising outputs. Third, it incorporates an adaptive diffusion step scheduler that dynamically balances artifact suppression and inference efficiency. Evaluated under low signal-to-noise ratio conditions, the method reduces word error rate by 15%, significantly improves phoneme accuracy and semantic plausibility, and decreases average inference latency by 32%. This work establishes a new paradigm for high-fidelity, low-latency diffusion-based speech enhancement.
📝 Abstract
Diffusion-based speech enhancement (SE) achieves natural-sounding speech and strong generalization, yet suffers from key limitations like generative artifacts and high inference latency. In this work, we systematically study artifact prediction and reduction in diffusion-based SE. We show that variance in speech embeddings can be used to predict phonetic errors during inference. Building on these findings, we propose an ensemble inference method guided by semantic consistency across multiple diffusion runs. This technique reduces WER by 15% in low-SNR conditions, effectively improving phonetic accuracy and semantic plausibility. Finally, we analyze the effect of the number of diffusion steps, showing that adaptive diffusion steps balance artifact suppression and latency. Our findings highlight semantic priors as a powerful tool to guide generative SE toward artifact-free outputs.