ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of existing GUI agents, which either rely on coordinate-based localization or suffer from data scarcity in coordinate-free approaches. The authors propose ToolTok, a framework that models GUI interaction as a multi-step tool-use sequence. By introducing learnable tool-token embeddings aligned with human interaction patterns and a semantic anchoring mechanism, ToolTok achieves strong generalization with minimal data. Integrated with a curriculum learning strategy—progressing from tool-definition question answering and textual tool selection to simplified visual path planning—and fine-tuning of a large language model, ToolTok outperforms same-scale (4B) baselines on multiple benchmarks using less than 1% of the training data, matching the performance of a 235B-parameter model and demonstrating exceptional generalization to unseen interfaces.

Technology Category

Application Category

📝 Abstract
Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training&inference code is open-source at https://github.com/ZephinueCode/ToolTok.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
generalization
data scarcity
input resolution
visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool tokenization
multi-step pathfinding
semantic anchuing
curriculum learning
GUI agents
🔎 Similar Papers
No similar papers found.
X
Xiaoce Wang
Department of Computer Science, Tsinghua University, Beijing, China
Guibin Zhang
Guibin Zhang
National University of Singapore
Multi-Agent SystemEfficient AI
J
Junzhe Li
Peking University, Beijing, China
J
Jinzhe Tu
Department of Computer Science, Tsinghua University, Beijing, China
Chun Li
Chun Li
MD Anderson Cancer Center
diagnostic imagingdrug deliverynanotechnology
Ming Li
Ming Li
Senior Research Scientist, Guangming Lab
AIGCMLLMsEmbodied AI