Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio questuin answering

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak reasoning capability and poor generalization of models in Audio Question Answering (Audio QA). We propose an error-aware curriculum learning framework integrating difficulty-aware sample ranking, guided selective chain-of-thought (CoT), and GRPO-based reinforcement training. Our key contributions are: (1) dynamically constructing a curriculum sequence prioritizing hard examples based on model prediction errors; (2) adaptively pruning redundant reasoning steps during inference to focus on critical acoustic-semantic alignments; and (3) optimizing CoT generation via GRPO to enhance information utilization efficiency. Evaluated on MMAU-mini and MMAR, our method achieves 73.80% and 64.30% accuracy, respectively—setting a new state-of-the-art on MMAR. Results demonstrate significant improvements in both robustness and multimodal reasoning capability.

Technology Category

Application Category

📝 Abstract
We propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought for audio question answering. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Integrated with GRPO training, these strategies enable the model to learn more effectively from informative samples. Experiments on MMAU-mini and MMAR demonstrate that Omni-CLST achieves competitive accuracy (73.80% on MMAU-mini) and establishes a new state of the art (64.30% on MMAR), highlighting its robustness and generalization capability in multimodal audio-language understanding.
Problem

Research questions and friction points this paper is trying to address.

Improving audio question answering accuracy through curriculum learning
Enhancing reasoning on challenging cases with selective chain-of-thought
Advancing multimodal audio-language understanding with error-aware training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Error-aware curriculum learning for difficulty
Guided selective chain-of-thought reasoning
GRPO training integration for effective learning
Jinghua Zhao
Jinghua Zhao
Nankai University
H
Hang Su
MiLM Plus, Xiaomi Inc., China
L
Lichun Fan
MiLM Plus, Xiaomi Inc., China
Zhenbo Luo
Zhenbo Luo
XiaoMi
Vision Language ModelComputer Vision
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis
H
Hui Wang
TMCC, College of Computer Science, Nankai University, Tianjin, China
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
Y
Yong Qin
TMCC, College of Computer Science, Nankai University, Tianjin, China