🤖 AI Summary
Current evaluations of AI agents predominantly focus on isolated capabilities, failing to capture their integrated performance in multimodal tasks that combine vision, web search, and programming. To address this gap, this work proposes CocoaBench—the first comprehensive benchmark for unified digital agents—centered on human-designed, long-horizon complex tasks. Evaluation relies solely on natural language instructions and end-to-end automated scoring functions, enabling scalable, fair, and architecture-agnostic assessment of multimodal coordination. The accompanying lightweight CocoaAgent framework facilitates controlled comparisons across model backbones. Notably, even state-of-the-art systems achieve only a 45.1% success rate on CocoaBench, revealing substantial limitations in reasoning, planning, tool usage, and visual grounding.
📝 Abstract
LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.