TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

📅 2026-01-11

📈 Citations: 1

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the fine-grained alignment challenge of “who spoke what and when” in multi-speaker speech recognition and speaker diarization by proposing a unified end-to-end framework based on large language models (LLMs). The approach decouples semantic content from speaker identity and introduces an interleaved temporal anchor mechanism to explicitly model speaker turn dynamics and generate precise timestamps. By leveraging serialized output training and a lightweight projector, the method achieves parameter-efficient learning while keeping the LLM backbone frozen. Evaluated on the AMI and AliMeeting datasets, the proposed framework significantly outperforms strong baselines such as Qwen-Omni and Gemini, notably reducing diarization error rates (DER) in overlapping speech scenarios while maintaining low computational overhead.

Technology Category

Application Category

📝 Abstract

We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that not only supports fine-grained timestamp prediction but also acts as a synchronization signal between semantic understanding and speaker tracking. Compared to previous works that primarily focus on speaker-attributed ASR or implicit diarization, TagSpeech addresses the challenge of fine-grained speaker-content alignment and explicitly models"who spoke what and when"in an end-to-end manner. Experiments on AMI and AliMeeting benchmarks demonstrate that our method achieves consistent improvements in Diarization Error Rate (DER) over strong end-to-end baselines, including Qwen-Omni and Gemini, particularly in handling complex speech overlaps. Moreover, TagSpeech employs a parameter-efficient training paradigm in which the LLM backbone is frozen and only lightweight projectors are trained, resulting in strong performance with low computational cost.

Problem

Research questions and friction points this paper is trying to address.

multi-speaker ASR

speaker diarization

temporal grounding

fine-grained alignment

end-to-end speech processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Anchor Grounding

Serialized Output Training

End-to-End Diarization