VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio language models lack effective evaluation on hour-long real-world audio contexts such as podcasts and lectures, as prevailing benchmarks rely on short clips or artificially concatenated segments that fail to capture genuine long-context comprehension. This work proposes VoiceGiraffe—the first comprehensive benchmark tailored for real-world, hour-scale audio—featuring 1,500 carefully curated audio–question triplets spanning both single-hop perception and multi-hop reasoning tasks. The study systematically evaluates open- and closed-source models under diverse paradigms, including end-to-end inference, cascaded caption aggregation, and external large language model augmentation. Findings reveal a pronounced bottleneck in models’ long-range memory persistence: while they adequately handle salient causal cues, their ability to track sparse events over extended durations lags far behind human performance. Moreover, different reasoning approaches exhibit complementary strengths, indicating substantial room for improvement across the board.
📝 Abstract
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.
Problem

Research questions and friction points this paper is trying to address.

long-context audio understanding
audio-language models
long-range information comprehension
hour-level audio
real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-context audio understanding
audio-language models
benchmark
memory persistence
multi-hop reasoning
🔎 Similar Papers