MOSS Transcribe Diarize Technical Report

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speaker diarization and timestamped transcription systems struggle to achieve end-to-end modeling due to limited context windows, weak long-range speaker memory, and insufficient timestamp precision. This work proposes the first unified multimodal large language model that enables end-to-end joint modeling of speaker attribution and timestamped transcription. The approach supports audio inputs up to 90 minutes in duration with a 128k-token context window, integrating explicit speaker memory mechanisms with robust long-range dependency modeling, and is trained on large-scale real-world data. Evaluated across multiple public and internal benchmarks, the system substantially outperforms current state-of-the-art commercial solutions, achieving significant advances in both transcription accuracy and speaker discrimination capability.

Technology Category

Application Category

📝 Abstract
Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Speaker-Attributed Transcription
Time-Stamped Transcription
Speaker Diarization
End-to-End Modeling
Meeting Transcription
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end SATS
multimodal large language model
speaker diarization
long-context transcription
time-stamped transcription
🔎 Similar Papers
No similar papers found.
D
Donghua Yu
Z
Zheng-Yu Lin
C
Chen Yang
Y
Yiyang Zhang
H
Hanfu Chen
J
Jingqi Chen
K
Ke Chen
L
Liwei Fan
Y
Yi Jiang
J
Jie Zhu
M
Muchen Li
W
Wenxuan Wang
Y
Yang Wang
Z
Zhe Xu
Y
Yitian Gong
Y
Yuqian Zhang
W
Wenbo Zhang
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
Q
Qinyuan Cheng
Shimin Li
Shimin Li
Fudan University
Large Language ModelSpeech Language Model
X
Xipeng Qiu