SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the limitations of offline reinforcement learning—its neglect of trajectory semantics—and the high interaction cost of online methods by proposing a semi-online reinforcement learning framework to enhance GUI agent performance in long-horizon tasks. The approach reconstructs trajectories from static datasets, identifies the first point of failure, and retroactively assigns dense rewards aligned with task objectives, effectively simulating online feedback. By innovatively incorporating trajectory-level execution quality into offline training, the method enables efficient policy learning without additional environment interactions. Integrating multimodal large language models, trajectory reconstruction, step-wise validity detection, and reward shaping, the framework significantly improves task success rates and robustness while achieving superior sample efficiency compared to strong baseline methods.

Technology Category

Application Category

📝 Abstract
As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Multimodal Large Language Models
GUI Agents
Offline RL
Long-horizon Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-Online RL
Long-horizon Tasks
Trajectory-level Reward Shaping
GUI Agents
Multimodal LLMs