Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing intelligent agents struggle to achieve context-awareness and proactive assistance in long-term, multi-device, multi-service environments, and lack standardized evaluation benchmarks. This work proposes the first evaluation framework tailored for always-on personal assistants, constructing a complex, noisy, yet realistic simulated environment that incorporates three key dimensions: long-horizon user behavior, multi-service coupling, and cross-interface (GUI/CLI) interaction. Leveraging a pipeline of iterative event injection and automated data generation, the project produces 2,000 training scenarios. Evaluation on this benchmark reveals that even GPT-5.5 achieves only a 34.5% pass@1 rate—significantly lower than on existing tasks—highlighting the challenge posed by the setup. Furthermore, fine-tuning base models on this data yields a 23.7% performance improvement, demonstrating the benchmark’s effectiveness and utility for advancing agent capabilities.

📝 Abstract

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

Problem

Research questions and friction points this paper is trying to address.

always-on personal assistants

user's digital world

context-sensitive reasoning

agent benchmarking

proactive assistance

Innovation

Methods, ideas, or system contributions that make the work stand out.

always-on personal assistants

context-rich benchmarking

multi-device GUI/CLI integration