🤖 AI Summary
Existing snapshot-and-restore mechanisms (e.g., CRIU, container commit) struggle to support exploratory execution in LLM agents—such as branching, backtracking, and multi-path search—especially when external resources (files, sockets, cloud APIs) are shared, suffering from poor latency, low stability, and high overhead. This paper systematically identifies three core challenges: (1) branch semantics (visibility of cross-branch state updates), (2) external side effects (service-aware interception of API calls), and (3) native forking (microsecond-scale cloning of databases and runtimes). To address them, we propose a fork semantics model tailored for agent exploration, design a lightweight service interception framework, and implement a native fork mechanism that avoids full-state copying. Experiments show that off-the-shelf tools collapse under real-world workloads; our approach reduces branch latency to the microsecond level, enables high-fidelity shared-state exploration, and establishes a foundational runtime substrate for LLM agent systems.
📝 Abstract
Agentic exploration, letting LLM-powered agents branch, backtrack, and search across many execution paths, demands systems support well beyond today's pass-at-k resets. Our benchmark of six snapshot/restore mechanisms shows that generic tools such as CRIU or container commits are not fast enough even in isolated testbeds, and they crumble entirely in real deployments where agents share files, sockets, and cloud APIs with other agents and human users. In this talk, we pinpoint three open fundamental challenges: fork semantics, which concerns how branches reveal or hide tentative updates; external side-effects, where fork awareness must be added to services or their calls intercepted; and native forking, which requires cloning databases and runtimes in microseconds without bulk copying.