Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing inference optimization methods struggle to adapt audio-visual large language models (AVLLMs) for complex multimodal reasoning. To address this, we propose AURELIA, the first framework introducing test-time inference distillation—enabling efficient transfer of structured, stepwise reasoning capabilities to AVLLMs without additional training or fine-tuning. For systematic evaluation, we introduce AVReasonBench, the first fine-grained audio-visual reasoning benchmark comprising 4,500 questions across six task categories, including the novel AV-GeoIQ dataset. Integrating Actor-Critic reinforcement learning, multimodal prompt engineering, and structured reasoning path modeling, AURELIA achieves up to 100% relative performance gain on AVReasonBench. Our evaluation comprehensively exposes critical reasoning deficiencies across 18 state-of-the-art AVLLMs and significantly enhances their practical deployability in real-world multimodal scenarios.

Technology Category

Application Category

📝 Abstract

Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.

Problem

Research questions and friction points this paper is trying to address.

Enhances audio-visual LLMs' reasoning without training

Addresses multi-modal input processing complexities in AVLLMs

Introduces benchmark for evaluating AV reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Actor-critic based AV reasoning framework

Test-time distillation of step-by-step reasoning

AVReasonBench with 4500 audio-visual questions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs