Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the limitations of current end-to-end large audio language models in speaker diarization and recognition, which struggle to distinguish between similar speakers due to the scarcity of large-scale real conversational data and the absence of explicit speaker representation optimization. To overcome this, the authors propose a global–local hierarchical speaker classification mechanism that generates global speaker labels via clustering and refines them into local, fine-grained labels through intra-cluster re-encoding. This approach jointly optimizes speaker classification, diarization, and speech recognition without relying on extensive real dialogue data. Experiments on the AliMeeting, AISHELL-4, and AMI-SDM datasets demonstrate that the method significantly enhances discrimination among similar speakers while preserving transcription accuracy, achieving performance comparable to or better than existing approaches based on simulated data or multiple encoders.

Technology Category

Application Category

📝 Abstract

Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder approaches, without relying on large-scale real conversational data.

Problem

Research questions and friction points this paper is trying to address.

speaker diarization

speaker recognition

speaker discriminability

large audio-language models

conversational data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-Local Speaker Classification

Joint Learning

Speaker Diarization and Recognition