Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the performance degradation in speech recognition and audio understanding caused by noise during multi-task fine-tuning. To mitigate this issue, the authors propose a self-reflective speech agent framework built upon an omni-perception architecture, which explicitly models whether to rely on its internal representations or invoke external perception modules through a learnable trust-based decision mechanism. This approach integrates self-reflective decision-making with joint speech-audio modeling to avoid noise-induced misguidance. Experimental results demonstrate that the method reduces word error rate by 12.1% across seven OpenASR benchmarks and achieves a 77.37% accuracy on audio question answering, with strong F1 scores and significantly improved generalization capability.

Technology Category

Application Category

📝 Abstract

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

Problem

Research questions and friction points this paper is trying to address.

speech recognition

audio reasoning

self-reflection

omni perception

voice agentic

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-reflection

voice agentic

omni perception