LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction

📅 2024-10-09
🏛️ IEEE Transactions on Audio, Speech, and Language Processing
📈 Citations: 2
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
To address the challenges of dynamic speaker count (up to eight), real-time processing, and long-context modeling in streaming diarization for hour-long audio, this paper proposes the first end-to-end online speaker diarization framework. Our method introduces three key innovations: (1) an online attractor mechanism for frame-wise incremental speaker embedding extraction; (2) a time-speaker dual-dimensional causal self-attention module to jointly model inter-frame and inter-speaker dependencies; and (3) a linear-complexity retention mechanism coupled with a multi-stage progressive training strategy to ensure stability and convergence for long-duration modeling. Evaluated on four major benchmarks—including CALLHOME—our approach achieves state-of-the-art online diarization error rate (DER) of 12.11%, with significantly lower real-time factor than existing methods. This work establishes a new paradigm for high-accuracy, low-latency, and long-duration streaming diarization.

Technology Category

Application Category

📝 Abstract
This work proposes a frame-wise online/streaming end-to-end neural diarization (EEND) method, which detects speaker activities in a frame-in-frame-out fashion. The proposed model mainly consists of a causal embedding encoder and an online attractor decoder. Speakers are modelled in the self-attention-based decoder along both the time and speaker dimensions, and frame-wise speaker attractors are automatically generated and updated for new speakers and existing speakers, respectively. Retention mechanism is employed and especially adapted for long-form diarization with a linear temporal complexity. A multi-step progressive training strategy is proposed for gradually learning from easy tasks to hard tasks in terms of the number of speakers and audio length. Finally, the proposed model (referred to as long-form streaming EEND, LS-EEND) is able to perform streaming diarization for a high (up to 8) and flexible number speakers and very long (say one hour) audio recordings. Experiments on various simulated and real-world datasets show that: 1) when not using oracle speech activity information, the proposed model achieves new state-of-the-art online diarization error rate on all datasets, including CALLHOME (12.11%), DIHARD II (27.58%), DIHARD III (19.61%), and AMI (20.76%); 2) Due to the frame-in-frame-out processing fashion and the linear temporal complexity, the proposed model achieves several times lower real-time-factor than comparison online diarization models.
Problem

Research questions and friction points this paper is trying to address.

Online speaker diarization for long audio recordings
Handling flexible speaker counts up to eight
Reducing computational complexity with linear temporal scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online attractor decoder for speaker modeling
Linear complexity retention mechanism for long audio
Progressive training strategy from easy to hard tasks
🔎 Similar Papers
No similar papers found.