🤖 AI Summary
This work addresses the performance degradation in speech recognition and audio understanding caused by noise during multi-task fine-tuning. To mitigate this issue, the authors propose a self-reflective speech agent framework built upon an omni-perception architecture, which explicitly models whether to rely on its internal representations or invoke external perception modules through a learnable trust-based decision mechanism. This approach integrates self-reflective decision-making with joint speech-audio modeling to avoid noise-induced misguidance. Experimental results demonstrate that the method reduces word error rate by 12.1% across seven OpenASR benchmarks and achieves a 77.37% accuracy on audio question answering, with strong F1 scores and significantly improved generalization capability.
📝 Abstract
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.