GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the accessibility bottleneck caused by the absence of speech-driven GUI agents, this paper introduces the first end-to-end speech-vision multimodal GUI agent framework. The method directly processes raw speech commands and screen screenshots, enabling cross-application interface action prediction via progressive vision-language alignment, multi-stage grounding, and planning modeling. To mitigate speech–text modality imbalance in multimodal pretraining, we propose a hybrid instruction training strategy and construct the first high-quality speech-command GUI dataset—featuring diverse, randomized voice TTS synthesis. Extensive experiments demonstrate significant performance gains over text-based baselines across multiple benchmarks, empirically validating speech as an effective modality for GUI control. Both code and dataset are publicly released to advance research in multimodal human–computer interaction.

Technology Category

Application Category

📝 Abstract

Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screenshots to predict actions. Confronted with the scarcity of speech-based GUI agent datasets, we initially generated high-quality speech instructions for training by leveraging a random timbre text-to-speech (TTS) model to convert existing text instructions. We then develop GUIRoboTron-Speech's capabilities through progressive grounding and planning training stages. A key contribution is a heuristic mixed-instruction training strategy designed to mitigate the modality imbalance inherent in pre-trained foundation models. Comprehensive experiments on several benchmark datasets validate the robust and superior performance of GUIRoboTron-Speech, demonstrating the significant potential and widespread applicability of speech as an effective instruction modality for driving GUI agents. Our code and datasets are available at https://github.com/GUIRoboTron/GUIRoboTron-Speech.

Problem

Research questions and friction points this paper is trying to address.

Autonomous GUI agents lack speech instruction support

Scarcity of speech-based GUI agent training datasets

Modality imbalance in pre-trained foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end GUI agent using speech instructions

Random timbre TTS for speech dataset generation

Heuristic mixed-instruction training for modality balance

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs