ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models' In-Context Learning Ability

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large audio language models (LALMs) lack systematic evaluation of their ability to infer task intent from in-context examples under audio conditions, particularly regarding the relationship between instruction following and task performance. This work proposes ALICE, a three-stage evaluation framework that progressively reduces textual guidance while integrating cross-modal in-context examples and structured output constraints to assess six prominent LALMs across four audio understanding tasks. The study reveals, for the first time, a pronounced asymmetry in LALMs’ in-context learning: while in-context examples consistently improve adherence to output formatting, they often fail to enhance—and sometimes even impair—core task performance. This discrepancy exposes fundamental limitations in LALMs’ mechanisms for cross-modal semantic integration.

Technology Category

Application Category

📝 Abstract
While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.
Problem

Research questions and friction points this paper is trying to address.

Large Audio-Language Models
In-Context Learning
Audio Conditioning
Cross-Modal Integration
Task Inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context learning
audio-language models
cross-modal integration
format compliance
task performance
🔎 Similar Papers
No similar papers found.