🤖 AI Summary
Existing AI-driven GUI interaction automation methods suffer from a cognitive gap relative to human natural operation patterns. This paper proposes the Blink-Think-Link (BTL) reasoning model—the first cognitive framework explicitly modeling the three-stage “gaze–reason–link” process, inspired by human eye-movement scanning and decision-making. Our key contributions are: (1) Blink Data Generation—a fully automated pipeline for GUI interaction trajectory annotation; (2) BTL Reward—the first dual-driven reward mechanism integrating rule-based constraints and dynamic feedback; and (3) a unified architecture integrating multimodal large language models, visual attention detection, cognitive reasoning, and executable command generation. Evaluated on both static understanding and dynamic interaction tasks, BTL achieves state-of-the-art performance, empirically validating the effectiveness and generalizability of this cognition-aligned paradigm for GUI agent development.
📝 Abstract
In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.