Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation in speech recognition and audio understanding caused by noise during multi-task fine-tuning. To mitigate this issue, the authors propose a self-reflective speech agent framework built upon an omni-perception architecture, which explicitly models whether to rely on its internal representations or invoke external perception modules through a learnable trust-based decision mechanism. This approach integrates self-reflective decision-making with joint speech-audio modeling to avoid noise-induced misguidance. Experimental results demonstrate that the method reduces word error rate by 12.1% across seven OpenASR benchmarks and achieves a 77.37% accuracy on audio question answering, with strong F1 scores and significantly improved generalization capability.

Technology Category

Application Category

📝 Abstract
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
Problem

Research questions and friction points this paper is trying to address.

speech recognition
audio reasoning
self-reflection
omni perception
voice agentic
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-reflection
voice agentic
omni perception
speech recognition
audio reasoning
🔎 Similar Papers
No similar papers found.
Zhen Wan
Zhen Wan
2nd year Ph.D. student, Kyoto University
Natural language processingInformation extraction
Chao-Han Huck Yang
Chao-Han Huck Yang
Sr. Research Scientist, NVIDIA Research
Robust Speech RecognitionLanguage ModelsPost-TrainingSequence Modeling
Jinchuan Tian
Jinchuan Tian
Language Technologies Institute, Carnegie Mellon University
Speech and Language Processing
Hanrong Ye
Hanrong Ye
NVIDIA Research
multi-task multi-modal models
Ankita Pasad
Ankita Pasad
NVIDIA
speech and language processingmachine learning
S
Szu-wei Fu
NVIDIA
Arushi Goel
Arushi Goel
Research Scientist, NVIDIA
Computer VisionMachine LearningVision and Language
Ryo Hachiuma
Ryo Hachiuma
NVIDIA
Computer VisionMachine Learning
Shizhe Diao
Shizhe Diao
NVIDIA Research
Large Language ModelsNatural Language Processing
Kunal Dhawan
Kunal Dhawan
Research Scientist, NVIDIA
Machine LearningDeep LearningSpeech ProcessingNatural Language ProcessingMultimodal ML
Sreyan Ghosh
Sreyan Ghosh
Ph.D. in CS at University of Maryland, College Park
AIMachine LearningNLPSpeech Recognition
Yusuke Hirota
Yusuke Hirota
NVIDIA
fairnessnatural language processingcomputer vision
Zhehuai Chen
Zhehuai Chen
NVIDIA
Speech RecognitionSpeech SynthesisLLM
Rafael Valle
Rafael Valle
NVIDIA, UC Berkeley, CNMAT
Machine Listening and Improvisation
E
Ehsan Hosseini Asl
NVIDIA
Chenhui Chu
Chenhui Chu
Kyoto University
Machine TranslationNatural Language ProcessingVision and LanguageSpeech Processing
Shinji Watanabe
Shinji Watanabe
Carnegie Mellon University
Speech recognitionSpeech processingSpeech enhancementSpeech translation
Yu-Chiang Frank Wang
Yu-Chiang Frank Wang
National Taiwan University & NVIDIA
Computer VisionDeep LearningMachine LearningArtificial Intelligence
Boris Ginsburg
Boris Ginsburg
NVIDIA
Deep LearningSpeech RecognitionSpeech Synthesis