ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to comprehensively evaluate the capabilities and safety of large language model (LLM) productivity agents in realistic, stateful multi-service environments. To address this gap, this work introduces a high-fidelity simulated office environment integrating five real-world services—including Gmail and Slack—with support for state snapshotting and rollback, along with 44 structured tasks spanning single-service, cross-service, and safety-critical scenarios. The study proposes a novel decoupled dual-lever architecture that separates domain-specific skill injection from meta-prompt coordination, enabling the first joint assessment of LLM agent performance and safety. Experiments reveal that even with full scaffolding, task success rates range only from 39% to 64%, while unsafe actions occur in 7% to 33% of cases. Top models achieve 53–63% success on OpenClaw but still exhibit 7–23% unsafe behavior, with no significant correlation between efficacy and safety, uncovering eight distinct patterns of unsafe conduct.
📝 Abstract
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
productivity tasks
evaluation benchmark
safety
stateful workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents
productivity benchmark
simulated workspaces
safety evaluation
stateful services
🔎 Similar Papers
No similar papers found.
X
Xiangyi Li
BenchFlow
K
Kyoung Whan Choe
RLWRLD
Y
Yimin Liu
Ohio State University
X
Xiaokun Chen
Stanford University
C
Chujun Tao
Carnegie Mellon University
B
Bingran You
UC Berkeley
W
Wenbo Chen
Amazon
Z
Zonglin Di
UC Santa Cruz
Jiankai Sun
Jiankai Sun
Stanford University
Artificial IntelligenceMachine LearningComputer VisionRobotics
S
Shenghan Zheng
Dartmouth College
Jiajun Bao
Jiajun Bao
Carnegie Mellon University
natural language processingcomputational linguisticsmachine learning
Yuanli Wang
Yuanli Wang
Boston University
Distributed SystemsMLSysLarge Language ModelsAgentic AI
Weixiang Yan
Weixiang Yan
Amazon
Code IntelligenceAgentic RLSoftware Automation
Yiyuan Li
Yiyuan Li
University of North Carolina at Chapel Hill
Natural Language ProcessingComputational Linguistics
H
Han-chung Lee
Independent