Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Dual-Strategy-Enhanced ConBiMamba, a novel system for speaker diarization that addresses the limitations of existing approaches in balancing fine-grained local details with long-range speaker consistency, while also mitigating high memory consumption and inaccurate boundary detection. For the first time, the ConBiMamba architecture is introduced to speaker diarization, effectively combining Conformer’s strong local modeling capacity with Mamba’s efficiency in handling long sequences. The method incorporates a boundary-enhanced transition loss to precisely localize speaker change points and employs a hierarchical feature aggregation mechanism to improve the utilization of multi-layer representations. Extensive experiments on six standard datasets demonstrate that the proposed approach achieves state-of-the-art performance on four of them, significantly reducing the Diarization Error Rate (DER) in speaker transition regions.

Technology Category

Application Category

📝 Abstract
Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer's self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer's convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer's self-attention with ExtBiMamba, ConBiMamba efficiently handles long audio sequences while alleviating the high memory cost of self-attention. Furthermore, to address the problem of the higher DER around speaker change points, we introduce the Boundary-Enhanced Transition Loss to enhance the detection of speaker change points. We also propose Layer-wise Feature Aggregation to enhance the utilization of multi-layer representations. The system is evaluated on six diarization datasets and achieves state-of-the-art performance on four of them. The source code of our study is available at https://github.com/lz-hust/DSE-CBM.
Problem

Research questions and friction points this paper is trying to address.

speaker diarization
Conformer
Mamba
long-range dependency
local feature modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConBiMamba
ExtBiMamba
Boundary-Enhanced Transition Loss
Layer-wise Feature Aggregation
speaker diarization
🔎 Similar Papers
No similar papers found.
Z
Zhen Liao
School of Electronic Information and Communications, Hubei Provincial Key Laboratory of Smart Internet Technology, Huazhong University of Science and Technology, China
Gaole Dai
Gaole Dai
PhD Candidate, Peking University
AI X LifeScience
M
Mengqiao Chen
School of Electronic Information and Communications, Hubei Provincial Key Laboratory of Smart Internet Technology, Huazhong University of Science and Technology, China
W
Wenqing Cheng
School of Electronic Information and Communications, Hubei Provincial Key Laboratory of Smart Internet Technology, Huazhong University of Science and Technology, China
Wei Xu
Wei Xu
University of Science and Technology of China
Computer VisionImage Processing