Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agents predominantly rely on trial-and-error decision-making, lacking progressive reasoning and continual learning capabilities; moreover, current evaluation benchmarks are oversimplified and fail to capture real-world interaction complexity. To address these limitations, we propose CogniGUI—a cognitive framework grounded in Kahneman’s dual-system theory—integrating rapid perceptual processing (System 1) with deliberate, stepwise reasoning (System 2) within an iterative “explore–learn–master” paradigm. We introduce Omni Parser for fine-grained visual-semantic parsing of GUI elements and design Group-based Relative Policy Optimization (GRPO), a novel reinforcement learning algorithm that leverages relative rewards to evaluate multi-path interactions. Additionally, we release ScreenSeek, a challenging new benchmark featuring cross-application navigation and interface consistency—key challenges in practical GUI interaction. Experiments demonstrate that CogniGUI significantly outperforms state-of-the-art methods on multiple GUI localization tasks and ScreenSeek, achieving substantial gains in generalization and dynamic adaptability.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents have made significant progress in automating digital tasks through the utilization of computer vision and language models. Nevertheless, existing agent systems encounter notable limitations. Firstly, they predominantly depend on trial and error decision making rather than progressive reasoning, thereby lacking the capability to learn and adapt from interactive encounters. Secondly, these systems are assessed using overly simplistic single step accuracy metrics, which do not adequately reflect the intricate nature of real world GUI interactions. In this paper, we present CogniGUI, a cognitive framework developed to overcome these limitations by enabling adaptive learning for GUI automation resembling human-like behavior. Inspired by Kahneman's Dual Process Theory, our approach combines two main components: (1) an omni parser engine that conducts immediate hierarchical parsing of GUI elements through quick visual semantic analysis to identify actionable components, and (2) a Group based Relative Policy Optimization (GRPO) grounding agent that assesses multiple interaction paths using a unique relative reward system, promoting minimal and efficient operational routes. This dual-system design facilitates iterative ''exploration learning mastery'' cycles, enabling the agent to enhance its strategies over time based on accumulated experience. Moreover, to assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence, which are often overlooked challenges in current benchmarks. Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
Problem

Research questions and friction points this paper is trying to address.

Enabling adaptive learning for human-like GUI automation
Overcoming trial-and-error limitations with progressive reasoning
Introducing comprehensive benchmarks for GUI agent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni parser engine for quick visual semantic analysis
GRPO agent with relative reward system optimization
ScreenSeek benchmark for multi-application navigation testing
🔎 Similar Papers
No similar papers found.
Jinjie Wei
Jinjie Wei
Fudan University
Large Language Model
J
Jiyao Liu
Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
M
Ming Hu
Shanghai Artificial Intelligence Laboratory
J
Junzhi Ning
Shanghai Artificial Intelligence Laboratory
Mingcheng Li
Mingcheng Li
Fudan University
Weijie Yin
Weijie Yin
ByteDance
Vision Language ModelDeep LearningAI4S
Junjun He
Junjun He
Shanghai Jiao Tong University
X
Xiao Liang
ByteDance Douyin Content Group
Chao Feng
Chao Feng
University of Zurich
networkmachine learningcybersecurity
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI