RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current vision-language models face dual challenges in high-resolution web interface understanding: loss of layout information and insufficient multi-step interactive reasoning capability. To address these, we propose a synergistic framework that jointly enhances structural awareness and logical reasoning. First, we introduce a novel layout-detection visual prompt to explicitly model the spatial structure of UI elements. Second, we design a Chain-of-Thought (CoT)-driven visual reasoning chain prompt to support sequential action decision-making. Third, we build a lightweight multimodal model based on the RWKV architecture, integrating layout-detection pretraining, CoT-based prompt engineering, and end-to-end joint fine-tuning. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in both UI layout understanding accuracy and multi-step interactive reasoning success rate. The framework establishes a new paradigm for complex interface understanding—offering improved interpretability, scalability, and generalizability.

Technology Category

Application Category

📝 Abstract

Existing Visual Language Modelsoften struggle with information loss and limited reasoning abilities when handling high-resolution web interfaces that combine complex visual, textual, and interactive elements. These challenges are particularly evident in tasks requiring webpage layout comprehension and multi-step interactive reasoning. To address these challenges, we propose RWKV-UI, a Visual Language Model based on the RWKV architecture, specifically designed to handle high-resolution UI images. During model training, we introduce layout detection as a visual prompt to help the model better understand the webpage layout structures. Additionally, we design a visual prompt based on the Chain-of-Thought(CoT) mechanism, which enhances the model's ability to understand and reason about webpage content through reasoning chains. Experimental results show that RWKV-UI demonstrates significant performance improvements in high-resolution UI understanding and interactive reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhances UI layout comprehension

Improves multi-step interactive reasoning

Reduces information loss in high-resolution UIs

Innovation

Methods, ideas, or system contributions that make the work stand out.

RWKV architecture for UI images

Layout detection as visual prompt

Chain-of-Thought for reasoning enhancement

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces