UI-Venus Technical Report: Building High-performance UI Agents with RFT

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of end-to-end UI grounding and navigation from raw screenshots—long-standing bottlenecks in visual-action joint modeling. We propose UI-Venus, the first high-performance, screenshot-only UI agent for precise UI element localization and sequential navigation. To tackle visual-action alignment, we introduce a self-evolving trajectory history alignment and sparse action augmentation framework, integrated with reinforcement fine-tuning (RFT), vision-language alignment optimization, and refined reward modeling—built upon the Qwen2.5-VL multimodal foundation model. Our approach significantly improves planning consistency and cross-task generalization. On Screenspot-V2/Pro benchmarks, UI-Venus achieves 94.1%/50.8% and 95.3%/61.9% accuracy for its 7B and 72B variants, respectively; in AndroidWorld navigation, success rates reach 49.1% and 65.9%. These results surpass all existing open- and closed-source methods, demonstrating state-of-the-art effectiveness in complex, real-world UI interaction tasks.

Technology Category

Application Category

📝 Abstract
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.
Problem

Research questions and friction points this paper is trying to address.

Building high-performance UI agents using only screenshots as input
Achieving SOTA performance in UI grounding and navigation tasks
Improving navigation with self-evolving trajectory history alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multimodal LLM with screenshot inputs
Reinforcement finetune (RFT) for SOTA performance
Self-evolving trajectory history alignment enhancement
Zhangxuan Gu
Zhangxuan Gu
Ant Group
computer vision
Z
Zhengwen Zeng
Ant Group
Zhenyu Xu
Zhenyu Xu
Texas Tech University
Machine LearningProgram RepairLarge Language Model
X
Xingran Zhou
Ant Group
Shuheng Shen
Shuheng Shen
Ant Group
Machine LearningOptimizationPrivacy
Y
Yunfei Liu
Ant Group
Beitong Zhou
Beitong Zhou
Huazhou University of Science and Technology
deep learningcomputer vision
C
Changhua Meng
Ant Group
Tianyu Xia
Tianyu Xia
Ant Group
Differential Privacy
W
Weizhi Chen
Ant Group
Yue Wen
Yue Wen
University of Central Florida
ProstheticsRehabilitation roboticsMachine learningAdaptive controlNeural interface
J
Jingya Dou
Ant Group
F
Fei Tang
Ant Group
J
Jinzhen Lin
Ant Group
Y
Yulin Liu
Ant Group
Z
Zhenlin Guo
Ant Group
Y
Yichen Gong
Ant Group
Heng Jia
Heng Jia
Zhejiang University
C
Changlong Gao
Ant Group
Y
Yuan Guo
Ant Group
Y
Yong Deng
Ant Group
Z
Zhenyu Guo
Ant Group
L
Liang Chen
Ant Group
W
Weiqiang Wang
Ant Group