AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Audio-Language Models (LALMs) face dual challenges in long-audio processing: quadratic attention complexity (O(N²)) and weak temporal dependency modeling; existing benchmarks predominantly target short audio, lacking evaluation capability for long-context understanding. This work introduces the first comprehensive benchmark for long-duration audio understanding, spanning speech, sound, and music domains, supporting input lengths of 2,250–7,500 audio tokens. It systematically establishes three novel evaluation dimensions: long-sequence comprehension, cross-domain generalization, and multi-hop reasoning. Leveraging optimization techniques—including token pruning and KV cache eviction—we conduct joint performance-efficiency evaluation across state-of-the-art LALMs. Results expose critical bottlenecks: sharp accuracy degradation and severe memory overhead under long-sequence inference. The benchmark provides a reproducible evaluation framework and concrete architectural guidance for developing efficient, scalable audio foundation models.

Technology Category

Application Category

📝 Abstract
Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context audio understanding and efficiency in LALMs
Addressing quadratic attention cost and long-range dependencies in audio
Benchmarking models on multi-hop reasoning across diverse audio domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AudioMarathon benchmark for long-context audio evaluation
Evaluates models on multi-hop reasoning across speech, sound, music
Analyzes token pruning and KV cache eviction acceleration techniques
🔎 Similar Papers
No similar papers found.
P
Peize He
Shanghai Jiao Tong University
Zichen Wen
Zichen Wen
Shanghai Jiao Tong University
Efficient AITrustworthy AILarge Language ModelMachine Learning
Y
Yubo Wang
Shanghai Jiao Tong University
Y
Yuxuan Wang
Shanghai Jiao Tong University
X
Xiaoqian Liu
Shanghai Jiao Tong University, Northeastern University
J
Jiajie Huang
Shanghai Jiao Tong University
Z
Zehui Lei
Shanghai Jiao Tong University
Z
Zhuangcheng Gu
Carnegie Mellon University
Xiangqi Jin
Xiangqi Jin
University of Electronic Science and Technology of China
J
Jiabing Yang
University of Chinese Academy of Sciences
K
Kai Li
Tsinghua University
Z
Zhifei Liu
Shanghai Jiao Tong University
W
Weijia Li
Sun Yat-sen University, Shanghai AI Laboratory
Cunxiang Wang
Cunxiang Wang
Tsinghua University; ZhipuAI
Large Language ModelsLLM EvaluationLLM Post-training
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Linfeng Zhang
Linfeng Zhang
DP Technology; AI for Science Institute
AI for Sciencemulti-scale modelingmolecular simulationdrug/materials design