PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing GUI agents, which are predominantly reactive and struggle to proactively infer user intent without explicit instructions. To bridge this gap, we introduce PIRA-Bench, the first benchmark specifically designed for evaluating proactive GUI agents, featuring complex interaction trajectories that interweave multiple intents and rich user context. We further propose PIRF, a baseline framework that integrates multimodal large language models with memory-aware mechanisms and state-tracking techniques to effectively manage noise in continuous visual inputs and coordinate multiple task threads. Experimental results demonstrate that PIRF significantly outperforms general-purpose multimodal models on PIRA-Bench, exhibiting superior performance in proactive intent recognition and recommendation tasks.

Technology Category

Application Category

📝 Abstract
Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm: a user must provide an explicit instruction for the agent to execute a task. However, an intelligent AI assistant should be proactive, which is capable of anticipating user intentions directly from continuous visual inputs, such as mobile or desktop screenshots, and offering timely recommendations without explicit user prompting. Transitioning to this proactive paradigm presents significant challenges. Real-world screen activity is rarely linear; it consists of long-horizon trajectories fraught with noisy browsing, meaningless actions, and multithreaded task-switching. To address this gap, we introduce PIRA-Bench (Proactive Intent Recommendation Agent Benchmark), a novel benchmark for evaluating multimodal large language models (MLLMs) on continuous, weakly-supervised visual inputs. Unlike reactive datasets, PIRA-Bench features complex trajectories with multiple interleaved intents and noisy segments with various user profile contexts, challenging agents to detect actionable events while fitting to user preferences. Furthermore, we propose the PIRF baseline, a memory-aware, state-tracking framework that empowers general MLLMs to manage multiple task threads and handle misleading visual inputs. PIRA-Bench serves as an initial step toward robust and proactive GUI-based personal assistants.
Problem

Research questions and friction points this paper is trying to address.

proactive intent recommendation
GUI agents
multimodal large language models
user intention anticipation
continuous visual inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive Intent Recommendation
GUI Agents
Multimodal Large Language Models
PIRA-Bench
State Tracking
🔎 Similar Papers
No similar papers found.
Yuxiang Chai
Yuxiang Chai
The Chinese University of Hong Kong
Computer VisionLLMAgent
S
Shunye Tang
Nankai University
Han Xiao
Han Xiao
MMLab CUHK
Computer VisionMachine Learning
R
Rui Liu
Huawei Research
H
Hongsheng Li
MMLab @ CUHK