🤖 AI Summary
Existing personalized speech enhancement methods rely heavily on pre-trained speaker encoders, suffer from architectural complexity, and underutilize registered speaker utterances. To address these limitations, this paper proposes an end-to-end encoder-free framework. Our method jointly models registered clean speech and noisy mixtures within a deep encoder-decoder architecture—eliminating external speaker encoders entirely. Key contributions include: (1) an Interactive Speaker Adaptation (ISA) module that dynamically establishes fine-grained cross-signal correlations between registered speech and noisy mixtures via feature modulation; and (2) a Channel-attention-driven Local–Global Context Aggregation (LCA) module that enhances contextual awareness in time-frequency representations. Evaluated on Libri2Mix, our approach achieves state-of-the-art performance, improving PESQ by over 1.2 points relative to strong encoder-based baselines—demonstrating both effectiveness and generalization capability.
📝 Abstract
Personalized speech enhancement (PSE) methods typically rely on pre-trained speaker verification models or self-designed speaker encoders to extract target speaker clues, guiding the PSE model in isolating the desired speech. However, these approaches suffer from significant model complexity and often underutilize enrollment speaker information, limiting the potential performance of the PSE model. To address these limitations, we propose a novel Speaker Encoder-Free PSE network, termed SEF-PNet, which fully exploits the information present in both the enrollment speech and noisy mixtures. SEF-PNet incorporates two key innovations: Interactive Speaker Adaptation (ISA) and Local-Global Context Aggregation (LCA). ISA dynamically modulates the interactions between enrollment and noisy signals to enhance the speaker adaptation, while LCA employs advanced channel attention within the PSE encoder to effectively integrate local and global contextual information, thus improving feature learning. Experiments on the Libri2Mix dataset demonstrate that SEF-PNet significantly outperforms baseline models, achieving state-of-the-art PSE performance.