GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the accessibility bottleneck caused by the absence of speech-driven GUI agents, this paper introduces the first end-to-end speech-vision multimodal GUI agent framework. The method directly processes raw speech commands and screen screenshots, enabling cross-application interface action prediction via progressive vision-language alignment, multi-stage grounding, and planning modeling. To mitigate speech–text modality imbalance in multimodal pretraining, we propose a hybrid instruction training strategy and construct the first high-quality speech-command GUI dataset—featuring diverse, randomized voice TTS synthesis. Extensive experiments demonstrate significant performance gains over text-based baselines across multiple benchmarks, empirically validating speech as an effective modality for GUI control. Both code and dataset are publicly released to advance research in multimodal human–computer interaction.

Technology Category

Application Category

📝 Abstract
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screenshots to predict actions. Confronted with the scarcity of speech-based GUI agent datasets, we initially generated high-quality speech instructions for training by leveraging a random timbre text-to-speech (TTS) model to convert existing text instructions. We then develop GUIRoboTron-Speech's capabilities through progressive grounding and planning training stages. A key contribution is a heuristic mixed-instruction training strategy designed to mitigate the modality imbalance inherent in pre-trained foundation models. Comprehensive experiments on several benchmark datasets validate the robust and superior performance of GUIRoboTron-Speech, demonstrating the significant potential and widespread applicability of speech as an effective instruction modality for driving GUI agents. Our code and datasets are available at https://github.com/GUIRoboTron/GUIRoboTron-Speech.
Problem

Research questions and friction points this paper is trying to address.

Autonomous GUI agents lack speech instruction support
Scarcity of speech-based GUI agent training datasets
Modality imbalance in pre-trained foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end GUI agent using speech instructions
Random timbre TTS for speech dataset generation
Heuristic mixed-instruction training for modality balance
🔎 Similar Papers
No similar papers found.
Wenkang Han
Wenkang Han
Zhejiang University
Vision-Language ModelAgentic Intelligence
Z
Zhixiong Zeng
Meituan
J
Jing Huang
Meituan
S
Shu Jiang
Meituan
L
Liming Zheng
Meituan
Longrong Yang
Longrong Yang
Zhejiang University
Computer Vision and Pattern Recognition
Haibo Qiu
Haibo Qiu
University of Sydney
Multimodal LLMVision and LanguageComputer Vision
C
Chang Yao
Zhejiang University
J
Jingyuan Chen
Zhejiang University
L
Lin Ma
Meituan