SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited multi-step audio-language reasoning capability of large audio-language models (ALMs). We propose a novel reinforcement learning framework integrating Structured Chain-of-Thought (CoT) reasoning with curriculum-guided Group-Relative Policy Optimization (GRPO). Methodologically, we pioneer the combination of explicit structured reasoning and curriculum learning, employing phased supervised fine-tuning—first on structured CoT, then unstructured CoT—followed by progressive-difficulty GRPO optimization, trained on a 32k multiple-choice audio reasoning dataset. Experiments demonstrate that structured CoT substantially outperforms free-form reasoning: it improves average accuracy by 16.35% on Qwen2-Audio-7B-Instruct, and a Qwen2.5-Omni variant achieves 67.08% on MMAU test-mini, setting a new state-of-the-art at the time. Our core contributions are: (1) empirical validation that structured reasoning yields significant gains in ALMs’ multi-step audio-language reasoning; and (2) the first curriculum-based RL framework specifically designed for joint audio-language reasoning.

Technology Category

Application Category

📝 Abstract
Recent work shows that reinforcement learning(RL) can markedly sharpen the reasoning ability of large language models (LLMs) by prompting them to"think before answering."Yet whether and how these gains transfer to audio-language reasoning remains largely unexplored. We extend the Group-Relative Policy Optimization (GRPO) framework from DeepSeek-R1 to a Large Audio-Language Model (LALM), and construct a 32k sample multiple-choice corpus. Using a two-stage regimen supervised fine-tuning on structured and unstructured chains-of-thought, followed by curriculum-guided GRPO, we systematically compare implicit vs. explicit, and structured vs. free form reasoning under identical architectures. Our structured audio reasoning model, SARI (Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning), achieves a 16.35% improvement in average accuracy over the base model Qwen2-Audio-7B-Instruct. Furthermore, the variant built upon Qwen2.5-Omni reaches state-of-the-art performance of 67.08% on the MMAU test-mini benchmark. Ablation experiments show that on the base model we use: (i) SFT warm-up is important for stable RL training, (ii) structured chains yield more robust generalization than unstructured ones, and (iii) easy-to-hard curricula accelerate convergence and improve final performance. These findings demonstrate that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.
Problem

Research questions and friction points this paper is trying to address.

Extends GRPO to improve audio-language reasoning in LALMs
Compares implicit vs explicit reasoning in identical architectures
Enhances audio understanding via structured reasoning and curriculum learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum-guided GRPO for audio reasoning
Two-stage SFT with structured CoT
Easy-to-hard curriculum boosts performance
🔎 Similar Papers
No similar papers found.