LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech

📅 2026-01-20

📈 Citations: 1

✨ Influential: 1

career value

211K/year

🤖 AI Summary

This work addresses the limited robustness and integrative reasoning capabilities of current speech models in long-form scenarios—such as meeting transcription and spoken document understanding—despite their strong performance on short utterances. To bridge this gap, we introduce LongSpeech, the first large-scale, extensible multitask benchmark for long speech, comprising over 100,000 audio segments averaging ten minutes each. LongSpeech supports diverse tasks including automatic speech recognition, speech translation, summarization, language identification, speaker counting, content disentanglement, and question answering. The benchmark is built from heterogeneous data sources and features multidimensional manual and automatic annotations, standardized evaluation protocols, and a reproducible construction pipeline, offering a unified platform for long-speech research. Preliminary evaluations reveal substantial performance gaps in state-of-the-art models, particularly in cross-task generalization and higher-order reasoning.

Technology Category

Application Category

📝 Abstract

Recent advances in audio-language models have demonstrated remarkable success on short, segment-level speech tasks. However, real-world applications such as meeting transcription, spoken document understanding, and conversational analysis require robust models capable of processing and reasoning over long-form audio. In this work, we present LongSpeech, a large-scale and scalable benchmark specifically designed to evaluate and advance the capabilities of speech models on long-duration audio. LongSpeech comprises over 100,000 speech segments, each approximately 10 minutes long, with rich annotations for ASR, speech translation, summarization, language detection, speaker counting, content separation, and question answering. We introduce a reproducible pipeline for constructing long-form speech benchmarks from diverse sources, enabling future extensions. Our initial experiments with state-of-the-art models reveal significant performance gaps, with models often specializing in one task at the expense of others and struggling with higher-level reasoning. These findings underscore the challenging nature of our benchmark. Our benchmark will be made publicly available to the research community.

Problem

Research questions and friction points this paper is trying to address.

long-form speech

speech understanding

benchmark

audio-language models

real-world applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-form speech

multitask benchmark

scalable pipeline