See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Multimodal GUI agents frequently perform erroneous redundant operations when executing toggle commands, as they fail to recognize that the target state is already satisfied—leading to low reliability. To address this “state redundancy” problem systematically for the first time, we propose State-aware Reasoning (StaR), a novel training framework that explicitly models the binary state (on/off) of GUI elements. StaR jointly optimizes visual perception, instruction understanding, and action decision-making modules in an end-to-end manner on a high-quality, self-constructed benchmark of binary toggle instructions. This enables precise discrimination of whether a state transition is actually required. Evaluated on three mainstream multimodal agents, StaR improves execution accuracy by over 30% on average. Moreover, it demonstrates superior generalization and practical utility across multiple public benchmarks and dynamic environments.

Technology Category

Application Category

📝 Abstract

The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.

Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable toggle control in GUI agents

Proposes state-aware reasoning for toggle state perception

Improves multimodal agent accuracy in instruction execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

State-aware Reasoning for toggle state perception

Training method for multimodal GUI agents

Improves toggle instruction execution accuracy

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents