EvA: An Evidence-First Audio Understanding Paradigm for LALMs

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the “evidence bottleneck” in existing large audio language models, which struggle to retain task-relevant acoustic evidence during inference in complex acoustic scenes. To overcome this limitation, the authors propose an “evidence-first” paradigm and introduce EvA, a dual-path fusion architecture that aggregates multi-scale intermediate features from Whisper and CED-Base in a non-compressed, temporally aligned manner to preserve critical acoustic cues. The study also releases EvA-Perception, a large-scale open-source dataset comprising event-ordered captions and question-answer pairs. Experimental results demonstrate that EvA achieves state-of-the-art open-source performance on MMAU, MMAR, and MMSU benchmarks, significantly outperforming Kimi-Audio-7B—particularly on perception-intensive tasks.

Technology Category

Application Category

📝 Abstract

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.

Problem

Research questions and friction points this paper is trying to address.

evidence bottleneck

audio understanding

acoustic evidence

Large Audio Language Models

perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidence-First

Audio Understanding

Dual-Path Architecture