AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models suffer from spatial detail loss and low inference efficiency in mobile UI automation. To address these challenges, we propose a lightweight vision-language framework: (1) token-level adaptive feature re-normalization (AFRN), which employs affine transformations to fuse high-resolution spatial details into low-resolution visual embeddings, significantly improving spatial fidelity; and (2) end-to-end joint modeling of GUI visual representations and human interaction behaviors (e.g., tapping, text input) built upon the Instruct-BLIP architecture. Our model contains less than one-quarter the parameters of state-of-the-art (SOTA) competitors and achieves substantially reduced inference latency. On the Meta-GUI and AITW benchmarks, it attains SOTA accuracy in GUI element recognition and task completion rate, demonstrating a synergistic improvement in both precision and efficiency.

Technology Category

Application Category

📝 Abstract
There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
Problem

Research questions and friction points this paper is trying to address.

Enhances widget identification in GUI automation
Reduces model size for faster inference
Improves low-resolution image embedding quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive feature renormalization enriches low-resolution image embeddings
Fuses high-resolution details via token-level affine transformation
Compact multimodal architecture outperforms larger competitors in GUI automation
🔎 Similar Papers
No similar papers found.