Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of speech recognition, speaker diarization, and timestamp localization in multi-speaker conversations—particularly those involving overlapping speech, rapid turn-taking, and limited contextual cues—by proposing an end-to-end spoken large language model. The model incorporates an agent-based multi-turn temporal reasoning mechanism and a speaker-aware cache, enabling joint prediction of speaker identity, gender, temporal boundaries, and transcribed text. Through a three-stage progressive training strategy, it iteratively captures the global structure of audio inputs. Evaluated on the AliMeeting and AISHELL-4 datasets, the proposed approach significantly outperforms strong baselines, demonstrating consistent performance gains especially in complex interactive settings with substantial speech overlap.
📝 Abstract
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.
Problem

Research questions and friction points this paper is trying to address.

multi-speaker ASR
speaker attribution
timestamp localization
overlapping speech
turn-taking
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn temporal reasoning
speaker-attributed ASR
agentic speech LLM
overlapping speech handling
speaker-aware cache
🔎 Similar Papers
No similar papers found.
Z
Zhennan Lin
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Shuai Wang
Shuai Wang
Nanjing University
AI
Z
Zhaokai Sun
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
P
Pengyuan Xie
Shanghai Lingguang Zhaxian Technology
C
Chuan Xie
Shanghai Lingguang Zhaxian Technology
J
Jie Liu
Shanghai Lingguang Zhaxian Technology
Q
Qiang Zhang
Shanghai Lingguang Zhaxian Technology
Lei Xie
Lei Xie
Northwestern Polytechnical University
speech processingspeech recognitionspeech synthesismultimediaartificial intelligence