Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

πŸ“… 2025-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient cross-modal interaction modeling and inadequate exploitation of inter-frame temporal context in audio-visual speaker diarization (AVSD), this paper proposes CASANetβ€”an end-to-end framework for MISP 2025 Task 1. CASANet introduces a novel co-modeling mechanism that jointly integrates cross-modal attention (CA) and intra-modal self-attention (SA). Additionally, it incorporates pseudo-label iterative refinement and overlapping-frame averaging for post-processing to enhance temporal prediction robustness. Evaluated on the official test set, CASANet achieves an 8.18% diarization error rate (DER), representing a 47.3% relative reduction over the baseline of 15.52%. This substantial improvement demonstrates significantly enhanced audio-visual synergy and speaker discrimination accuracy in multi-speaker scenarios.

Technology Category

Application Category

πŸ“ Abstract
This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative improvement of 47.3% over the baseline DER of 15.52%.
Problem

Research questions and friction points this paper is trying to address.

Develops CASA-Net for audio-visual speaker diarization
Enhances cross-modal interaction via attention mechanisms
Improves timestamp accuracy with pseudo-label refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention module captures audio-visual interactions
Self-attention module learns contextual frame relationships
Pseudo-label refinement enhances timestamp prediction accuracy
πŸ”Ž Similar Papers
No similar papers found.
Zhaoyang Li
Zhaoyang Li
Ph.D student, University of Science and Technology of China
Computer Vision
H
Haodong Zhou
School of Electronic Science and Engineering, Xiamen University, China
Longjie Luo
Longjie Luo
Xiamen University
speech signal processing
X
Xiaoxiao Li
School of Electronic Information, Beijing Jiaotong University, China
Yongxin Chen
Yongxin Chen
Georgia Institute of Technology
control theorymachine learningroboticsoptimal transportoptimization
L
Lin Li
School of Informatics, Xiamen University, China
Q
Q. Hong
School of Informatics, Xiamen University, China