Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work proposes Dual-Strategy-Enhanced ConBiMamba, a novel system for speaker diarization that addresses the limitations of existing approaches in balancing fine-grained local details with long-range speaker consistency, while also mitigating high memory consumption and inaccurate boundary detection. For the first time, the ConBiMamba architecture is introduced to speaker diarization, effectively combining Conformer’s strong local modeling capacity with Mamba’s efficiency in handling long sequences. The method incorporates a boundary-enhanced transition loss to precisely localize speaker change points and employs a hierarchical feature aggregation mechanism to improve the utilization of multi-layer representations. Extensive experiments on six standard datasets demonstrate that the proposed approach achieves state-of-the-art performance on four of them, significantly reducing the Diarization Error Rate (DER) in speaker transition regions.

Technology Category

Application Category

📝 Abstract

Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer's self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer's convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer's self-attention with ExtBiMamba, ConBiMamba efficiently handles long audio sequences while alleviating the high memory cost of self-attention. Furthermore, to address the problem of the higher DER around speaker change points, we introduce the Boundary-Enhanced Transition Loss to enhance the detection of speaker change points. We also propose Layer-wise Feature Aggregation to enhance the utilization of multi-layer representations. The system is evaluated on six diarization datasets and achieves state-of-the-art performance on four of them. The source code of our study is available at https://github.com/lz-hust/DSE-CBM.

Problem

Research questions and friction points this paper is trying to address.

speaker diarization

Conformer

Mamba

long-range dependency

local feature modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

ConBiMamba

ExtBiMamba

Boundary-Enhanced Transition Loss