🤖 AI Summary
This work proposes the first single-pass, end-to-end framework for long-form audio understanding, capable of processing audio up to 60 minutes in a unified manner. Addressing the challenges posed by fragmented context and overlapping speakers in scenarios such as meetings and podcasts, the approach jointly integrates automatic speech recognition, speaker diarization, and timestamp generation. It employs a prompt-based context injection mechanism that enhances the accuracy of domain-specific terminology and homophone disambiguation without requiring explicit language identifiers. Built upon the VibeVoice architecture, the model leverages multi-task joint modeling and supports multilingual and code-switching inputs, significantly outperforming existing systems in complex, long-duration settings by achieving high-fidelity transcription and precise speaker attribution.
📝 Abstract
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.