🤖 AI Summary
To address the insufficient robustness of large language model (LLM)-driven autonomous agents in real-world anomalous scenarios—necessitating human-in-the-loop (HITL) collaboration—this paper proposes a sandboxed execution framework supporting hybrid human-machine takeover. Methodologically, it introduces: (1) an Adaptive Streaming Protocol (ASP) that jointly encodes command instructions and video streams to enable low-latency, high-reliability dynamic task handover over heterogeneous networks; (2) a cross-platform virtualized sandbox environment integrating hybrid command-video transmission, programmable APIs, and manual control interfaces; and (3) lightweight AI integration via the Model Control Protocol (MCP) standard and open-source SDKs. Experimental evaluation demonstrates a 48.3% improvement in task success rate, 50% reduction in bandwidth consumption, and 5.1% decrease in end-to-end latency. Moreover, system stability and user experience are significantly enhanced under weak-network conditions.
📝 Abstract
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ultra-low-latency, smoother user experience that remains resilient even in weak network environments. It achieves this by dynamically blending command-based and video-based streaming, adapting its encoding strategy based on network conditions and the current controller (AI or human). Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement. Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP, and in end-to-end latency with around 5% reduction, especially under poor network conditions. We posit that AgentBay provides a foundational primitive for building the next generation of reliable, human-supervised autonomous systems.