IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing audio large language models (Audio-LLMs) lack standardized, fine-grained evaluation of instruction-following capabilities, particularly for structural fidelity in multimodal alignment. Method: We introduce Audio-IFB—the first structured, audio-modal instruction-following benchmark—comprising 280 human-crafted audio–instruction–response triplets covering six dimensions: content accuracy, case sensitivity, symbol usage, list structure, length constraints, and output formatting. We further design an automated, fine-grained evaluation protocol with deterministic validation rules. Contribution/Results: Our work formally defines and quantifies structural biases in Audio-LLM instruction alignment, addressing a critical gap in audio-specific instruction-following assessment. We conduct comprehensive evaluations across state-of-the-art Audio-LLMs, revealing systematic limitations in structural adherence. To foster community advancement, we publicly release the Audio-IFB dataset and evaluation framework.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.

Problem

Research questions and friction points this paper is trying to address.

Assessing instruction-following in audio-based LLMs

Evaluating multimodal alignment impact on LLM performance

Creating benchmark for audio-involved instruction compliance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IFEval-Audio for audio LLM evaluation

Contains 280 audio-instruction-answer triples

Benchmarks state-of-the-art audio LLMs

🔎 Similar Papers

AudioBench: A Universal Benchmark for Audio Large Language Models

2024-06-23arXiv.orgCitations: 17

SpeechVerse: A Large-scale Generalizable Audio Language Model

2024-05-14arXiv.orgCitations: 29

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)