SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the high verification cost and low reliability of autonomous agents in GUI tasks, this paper proposes an active, in-situ self-verification paradigm: agents concurrently capture refined, decisive screenshot evidence during execution—replacing costly post-hoc analysis of lengthy interaction traces. We introduce a novel self-verifying agent architecture guided by the 3C principles (Completeness, Conciseness, Creativity), jointly optimizing task solving and evidence generation. The architecture integrates accessibility-aware UI parsing, LLM-driven evidence cropping, lightweight snapshot generation, and a general-purpose LLM-as-a-Judge verification interface. Evaluated on multi-scale mobile GUI tasks, our method achieves performance gains of +26.08% (with 8B LLMs) and +16.66% (with 30B LLMs), matching the effectiveness of state-of-the-art models such as DeepSeek-V3.1 and Qwen3-235B-A22B, while decisively overcoming the limitations of conventional passive verification.

Technology Category

Application Category

📝 Abstract

Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.

Problem

Research questions and friction points this paper is trying to address.

Addresses passive post-hoc verification challenges in agentic RL tasks

Proposes proactive self-verification with curated snapshots for task completion

Enables scalable training of LLM-driven agents with performance gains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive self-verification replaces passive post-hoc verification

Agent collects minimal decisive snapshots guided by 3C principles

Snapshots enable scalable LLM-driven agent training with performance gains

🔎 Similar Papers

No similar papers found.