ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing benchmarks struggle to evaluate multimodal agent collaboration under multi-day, multi-turn scenarios with dynamically evolving environments. To address this gap, this work proposes the first sandboxed evaluation framework that supports state evolution, simulating a realistic office setting with time-varying multimodal contexts—including email, calendar, file systems, knowledge bases, and spreadsheets—and rigorously assessing performance on 100 cross-day tasks using 1,537 deterministic rule-based verifiers. Experimental results reveal that even the best-performing agent system achieves a weighted score of only 75.8%, with an end-to-end task success rate as low as 20.0%, underscoring that sustained collaboration and reliable workflow completion over extended periods remain significant challenges in current agent architectures.

Technology Category

Application Category

📝 Abstract

Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0\%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.

Problem

Research questions and friction points this paper is trying to address.

coworker agents

multi-turn

multi-day

stateful environment

multimodal

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-day agent benchmark

stateful sandbox environment

deterministic rule-based evaluation