🤖 AI Summary
Current evaluations of AI agents predominantly rely on synthetic environments and short-duration tasks, which inadequately capture their capacity to perform complex, long-horizon tasks in real-world settings. To address this gap, this work introduces the first long-horizon agent benchmark grounded in native command-line interface (CLI) environments, comprising 60 bilingual, multimodal tasks that average eight minutes in duration and require over 20 invocations of real-world tools per task. The benchmark employs Dockerized deployments of mainstream agent frameworks—such as OpenClaw and Claude Code—and integrates deterministic rules, environment state auditing, and large language model–based semantic judgment to establish a reproducible end-to-end evaluation pipeline. Experimental results reveal that even the best-performing model, Claude Opus 4.7 under OpenClaw, achieves only a 62.2% success rate, with all other models falling below 60%; moreover, switching frameworks induces performance fluctuations of up to 18 percentage points, underscoring the substantial limitations of current agents in authentic long-horizon scenarios.
📝 Abstract
Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.