BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing AI-driven GUI interaction automation methods suffer from a cognitive gap relative to human natural operation patterns. This paper proposes the Blink-Think-Link (BTL) reasoning model—the first cognitive framework explicitly modeling the three-stage “gaze–reason–link” process, inspired by human eye-movement scanning and decision-making. Our key contributions are: (1) Blink Data Generation—a fully automated pipeline for GUI interaction trajectory annotation; (2) BTL Reward—the first dual-driven reward mechanism integrating rule-based constraints and dynamic feedback; and (3) a unified architecture integrating multimodal large language models, visual attention detection, cognitive reasoning, and executable command generation. Evaluated on both static understanding and dynamic interaction tasks, BTL achieves state-of-the-art performance, empirically validating the effectiveness and generalizability of this cognition-aligned paradigm for GUI agent development.

Technology Category

Application Category

📝 Abstract

In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

Problem

Research questions and friction points this paper is trying to address.

Addresses deviation from human-GUI interaction patterns

Proposes brain-inspired framework mimicking cognitive processes

Develops automated annotation and reinforcement learning mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Blink-Think-Link brain-inspired cognitive framework

Automated annotation pipeline for blink data

Rule-based reward mechanism for reinforcement learning

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents