BLAB: Brutally Long Audio Bench

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Current audio-language models (ALMs) lack rigorous evaluation on long-duration speech understanding, particularly in realistic conversational settings. Method: We introduce BLAB, the first long-audio benchmark tailored to authentic dialogues—comprising 833+ hours of real-world conversations (average duration: 51 minutes)—supporting four task categories: temporal localization, duration estimation, emotion recognition, and counting. Our evaluation framework employs human-annotated natural language questions and integrates human-AI collaborative filtering to ensure ethical compliance. We systematically assess six state-of-the-art ALMs, including GPT-4o and Gemini 2.0 Pro. Results: All models exhibit severe performance degradation on long audio (multi-task accuracy <30%), exposing fundamental limitations in non-phonemic modeling, temporal reasoning, and audio-prompt alignment. This work establishes the first multidimensional, natural language–driven evaluation paradigm for long audio, providing both a critical benchmark and diagnostic toolkit to advance beyond short-audio constraints.

Technology Category

Application Category

📝 Abstract

Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating audio LMs on long-form conversational speech segments

Assessing performance on localization, duration, emotion, and counting tasks

Addressing challenges in understanding non-phonemic information in long audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BLAB benchmark for long-form audio

Evaluates audio LMs on diverse tasks

Uses human-annotated, full-length audio clips

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)