ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
Existing benchmarks struggle to evaluate multimodal agent collaboration under multi-day, multi-turn scenarios with dynamically evolving environments. To address this gap, this work proposes the first sandboxed evaluation framework that supports state evolution, simulating a realistic office setting with time-varying multimodal contexts—including email, calendar, file systems, knowledge bases, and spreadsheets—and rigorously assessing performance on 100 cross-day tasks using 1,537 deterministic rule-based verifiers. Experimental results reveal that even the best-performing agent system achieves a weighted score of only 75.8%, with an end-to-end task success rate as low as 20.0%, underscoring that sustained collaboration and reliable workflow completion over extended periods remain significant challenges in current agent architectures.

Technology Category

Application Category

📝 Abstract
Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0\%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.
Problem

Research questions and friction points this paper is trying to address.

coworker agents
multi-turn
multi-day
stateful environment
multimodal
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-day agent benchmark
stateful sandbox environment
deterministic rule-based evaluation
exogenous environment adaptation
multimodal coworker agents
🔎 Similar Papers