CogAgent: A Visual Language Model for GUI Agents

📅 2023-12-14

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 207

✨ Influential: 40

career value

196K/year

🤖 AI Summary

Existing GUI agents struggle to comprehend screen visual elements (e.g., icons, buttons), limiting their automation capability. This paper introduces CogAgent, an 18B-parameter vision-language model specifically designed for GUI understanding and navigation, capable of cross-platform (PC/Android) interactive task execution from raw screenshots alone. Key contributions include: (i) the first dual-resolution image encoder supporting high-fidelity 1120×1120 inputs; (ii) the first demonstration that purely visual input surpasses HTML-text-augmented LLMs on GUI navigation tasks; and (iii) end-to-end modeling of GUI action prediction, refined via large-scale GUI instruction tuning. CogAgent achieves state-of-the-art performance across nine VQA benchmarks and significantly outperforms prior methods on Mind2Web and AITW navigation benchmarks. The model and code are publicly released.

📝 Abstract

People are spending an enormous amount of time on dig-ital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogA-gent supports input at a resolution of1120 × 1120, enabling it to recognize tiny page elements and text. As a general-ist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK- VQA, Text- Vqa, St- Vqa, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks-Mind2Web and AITW, ad-vancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM.

Problem

Research questions and friction points this paper is trying to address.

Visual Understanding

Graphical User Interface

Automation Limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

CogAgent

Visual Information Understanding

High Definition Image Recognition

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces