🤖 AI Summary
Existing LLM-based agents exhibit efficiency bottlenecks in API-free, pixel-level GUI environments due to reliance on local visual observations, myopic decision-making, and inefficient trial-and-error—hindering long-horizon planning and skill transfer. To address this, we propose the State-Action Knowledge Graph (SA-KG): a persistent, topology-aware graph-structured memory enabling cross-visual-difference state alignment and experience reuse. We further design a hybrid intrinsic reward mechanism that decouples strategic policy planning from stochastic exploration, facilitating principled action evaluation under delayed rewards. Our method integrates LLM-based reasoning, graph neural networks, and a hierarchical reward scheme grounded in novelty and state-value estimation. Evaluated on *Civilization V* and *Slay the Spire*, our approach significantly improves exploration efficiency, zero-shot generalization, and deep strategic reasoning—outperforming state-of-the-art pixel-level agent methods.
📝 Abstract
Most existing software lacks accessible Application Programming Interfaces (APIs), requiring agents to operate solely through pixel-based Graphical User Interfaces (GUIs). In this API-free setting, large language model (LLM)-based agents face severe efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering both skill acquisition and long-term planning. To address these challenges, we propose KG-Agent, an experience-driven learning framework that structures an agent's raw pixel-level interactions into a persistent State-Action Knowledge Graph (SA-KG). KG-Agent overcomes inefficient exploration by linking functionally similar but visually distinct GUI states, forming a rich neighborhood of experience that enables the agent to generalize from a diverse set of historical strategies. To support long-horizon reasoning, we design a hybrid intrinsic reward mechanism based on the graph topology, combining a state value reward for exploiting known high-value pathways with a novelty reward that encourages targeted exploration. This approach decouples strategic planning from pure discovery, allowing the agent to effectively value setup actions with delayed gratification. We evaluate KG-Agent in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over the state-of-the-art methods.