UI-Venus Technical Report: Building High-performance UI Agents with RFT

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of end-to-end UI grounding and navigation from raw screenshots—long-standing bottlenecks in visual-action joint modeling. We propose UI-Venus, the first high-performance, screenshot-only UI agent for precise UI element localization and sequential navigation. To tackle visual-action alignment, we introduce a self-evolving trajectory history alignment and sparse action augmentation framework, integrated with reinforcement fine-tuning (RFT), vision-language alignment optimization, and refined reward modeling—built upon the Qwen2.5-VL multimodal foundation model. Our approach significantly improves planning consistency and cross-task generalization. On Screenspot-V2/Pro benchmarks, UI-Venus achieves 94.1%/50.8% and 95.3%/61.9% accuracy for its 7B and 72B variants, respectively; in AndroidWorld navigation, success rates reach 49.1% and 65.9%. These results surpass all existing open- and closed-source methods, demonstrating state-of-the-art effectiveness in complex, real-world UI interaction tasks.

Technology Category

Application Category

📝 Abstract

We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.

Problem

Research questions and friction points this paper is trying to address.

Building high-performance UI agents using only screenshots as input

Achieving SOTA performance in UI grounding and navigation tasks

Improving navigation with self-evolving trajectory history alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multimodal LLM with screenshot inputs

Reinforcement finetune (RFT) for SOTA performance

Self-evolving trajectory history alignment enhancement

🔎 Similar Papers

Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents