IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio large language models (Audio-LLMs) lack standardized, fine-grained evaluation of instruction-following capabilities, particularly for structural fidelity in multimodal alignment. Method: We introduce Audio-IFB—the first structured, audio-modal instruction-following benchmark—comprising 280 human-crafted audio–instruction–response triplets covering six dimensions: content accuracy, case sensitivity, symbol usage, list structure, length constraints, and output formatting. We further design an automated, fine-grained evaluation protocol with deterministic validation rules. Contribution/Results: Our work formally defines and quantifies structural biases in Audio-LLM instruction alignment, addressing a critical gap in audio-specific instruction-following assessment. We conduct comprehensive evaluations across state-of-the-art Audio-LLMs, revealing systematic limitations in structural adherence. To foster community advancement, we publicly release the Audio-IFB dataset and evaluation framework.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.
Problem

Research questions and friction points this paper is trying to address.

Assessing instruction-following in audio-based LLMs
Evaluating multimodal alignment impact on LLM performance
Creating benchmark for audio-involved instruction compliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IFEval-Audio for audio LLM evaluation
Contains 280 audio-instruction-answer triples
Benchmarks state-of-the-art audio LLMs
🔎 Similar Papers
No similar papers found.