🤖 AI Summary
Existing closed-set speaker recognition methods struggle to effectively integrate temporal information across multiple time scales. To address this limitation, this work proposes TARNet, which explicitly models short-, medium-, and long-term temporal dependencies through a multi-stage dilated convolutional encoder. Furthermore, an attentive statistics pooling (ASP) module is introduced to enable adaptive aggregation of multi-granularity features, yielding highly discriminative speaker embeddings. The proposed approach achieves state-of-the-art performance on both the VoxCeleb1 and LibriSpeech datasets while maintaining low computational complexity, thus offering a favorable balance between accuracy and practicality.
📝 Abstract
Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP) module to produce a discriminative utterance-level speaker embedding. Experiments on the VoxCeleb1 and LibriSpeech datasets show that TARNet outperforms state-of-the-art methods while maintaining competitive computational complexity, making it suitable for practical speaker identification systems. The code is publicly available at https://github.com/YassinTERRAF/TARNet.