🤖 AI Summary
Multimodal large language models (MLLMs) exhibit weak semantic alignment and low exploration efficiency in GUI navigation, particularly struggling with complex semantic–spatial correlations.
Method: This paper proposes the Adaptive Exploration Policy Optimization (AEPO) framework, featuring a multi-answer generation mechanism to broaden semantic coverage and an Efficiency-driven Adaptive Exploration Reward (AER) function—grounded in theoretical principles—to enable verifiable, reward-guided reinforcement learning. AEPO integrates MLLMs, reinforcement learning, and a formally verifiable reward mechanism to jointly optimize semantic fidelity and spatial localization accuracy.
Contribution/Results: The framework trains InfiGUI-G1-3B/7B models that achieve state-of-the-art performance on multiple GUI localization benchmarks, outperforming prior baselines by up to 9.0% absolute accuracy gain while preserving semantic consistency.
📝 Abstract
The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.