GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited task-completion performance of existing GUI agents, which stems from their lack of explicit world knowledge about interface interactions and reliance on inefficient post-training methods that only implicitly acquire such knowledge. To overcome this, the authors propose a mid-training framework that explicitly internalizes GUI world knowledge during the intermediate phase of training. The approach extracts static planning structures and dynamic causal relationships from interaction trajectories through causal internalization and enhances training data quality via density-aware example reselection, causal rewards, and semantic redundancy penalties. By integrating data synthesis, semantic deduplication, and causal reasoning, the method significantly improves agent comprehension and success rates across two GUI knowledge benchmarks and three task-completion benchmarks, thereby surpassing conventional post-training paradigms.
📝 Abstract
Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
world knowledge
mid-training
causal internalization
task completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Internalization
Density-aware Exemplar Reselection
Mid-training
GUI Agents
World Knowledge Internalization