π€ AI Summary
Existing checkpointing and recovery mechanisms in agent sandboxes struggle to balance correctness and efficiency: application-level approaches overlook operating system side effects, while full-system, per-turn checkpointing incurs prohibitive overhead. This work proposes a semantic-aware, on-demand checkpointing mechanism that bridges the agent framework and the OS layer for the first time, dynamically identifying whether an interaction induces recoverable state changes. Exploiting the sparsity that over 75% of turns involve no relevant state modifications, it triggers lightweight, precise checkpoints only when necessary. The system leverages eBPF-based runtime detection, a coordinator that overlaps checkpointing with LLM waiting time, and a multi-sandbox schedulerβall without modifying the agent logic or the underlying checkpoint/restore infrastructure. Experiments show recovery correctness improves from 8% to 100% on shell-intensive and code repair tasks, checkpoint traffic is reduced by up to 87%, and normal execution overhead remains below 1.9%.
π Abstract
Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approaches fall into two extremes: application-level recovery preserves chat history but misses OS-side effects, while full per-turn checkpointing is correct but too expensive under dense co-location. The root cause is an agent-OS semantic gap: agent frameworks see tool calls but not their OS effects; the OS sees state changes but lacks turn-level context to judge recovery relevance. This gap hides massive sparsity: over 75% of agent turns produce no recovery-relevant state, so most checkpoints are unnecessary. Crab (Checkpoint-and-Restore for Agent SandBoxes) is a transparent host-side runtime that bridges this gap without modifying agents or C/R backends. An eBPF-based inspector classifies each turn's OS-visible effects to decide checkpoint granularity; a coordinator aligns checkpoints with turn boundaries and overlaps C/R with LLM wait time; and a host-scoped engine schedules checkpoint traffic across co-located sandboxes. On shell-intensive and code-repair workloads, Crab raises recovery correctness from 8% (chat-only) to 100%, cuts checkpoint traffic by up to 87%, and stays within 1.9% of fault-free execution time.