Faithful Mobile GUI Agents with Guided Advantage Estimator

πŸ“… 2026-05-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

214K/year
πŸ€– AI Summary
This work addresses the tendency of existing vision-language GUI agents to rely on memorization shortcuts, leading to decisions that are unfaithful to on-screen content or user instructions. To mitigate this, the authors propose Faithful-Agent, a novel framework that prioritizes faithfulness through a dedicated training paradigm. It first introduces a refusal mechanism via supervised fine-tuning and then employs GRPO-based reinforcement fine-tuning augmented with a newly designed Guided Advantage Estimator (GuAE) to alleviate advantage collapse under sparse rewards and low-variance trajectories. Additionally, a thought-action consistency reward is incorporated to enhance behavioral reliability. The approach achieves a substantial improvement in Trap success rateβ€”from 13.88% to 80.21%β€”while preserving strong general instruction-following capabilities.
πŸ“ Abstract
Vision-language model based graphical user interface (GUI) agents have shown strong interaction capabilities. However, they often behave unfaithfully, relying on memorized shortcuts rather than grounding actions in displayed screen evidence or user instructions. To address this, we propose Faithful-Agent, a faithfulness-first framework that reformulates GUI interaction to prioritize evidence groundedness and internal consistency. Faithful-Agent employs a two-stage pipeline: (i) a faithfulness-oriented SFT stage to instill abstainment behaviors under evidence perturbations; (ii) an RFT stage that further amplifies faithfulness by introducing the guided advantage estimator (GuAE), an anchor-based and variance-adaptive advantage tempering mechanism built upon GRPO. GuAE prevents advantage collapse in low-variance rollout groups under sparse GUI rewards, and with a thought-action consistency reward, Faithful-Agent (Stage II) elevates the Trap SR from 13.88\% to 80.21\% relative to the baseline, while preserving robust general instruction-following performance.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
faithfulness
evidence grounding
instruction following
visual-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

faithful GUI agent
guided advantage estimator
GRPO
evidence groundedness
thought-action consistency
πŸ”Ž Similar Papers