🤖 AI Summary
This work proposes a browser-based intelligent agent that learns to perform complex, multi-step tasks by passively observing user interactions, while continuously abstracting behavioral experience into reusable organizational knowledge—such as task boards and wikis—to enable human-agent collaborative editing and long-term evolution. Key methodological innovations include adaptive passive observation, verbal-diff–based history compression, a coarse-grained action space, an N-best action selection strategy, and a behavior-driven knowledge construction pipeline. Evaluated on the human-annotated subset of WebArena, the system achieves an 80.4% task success rate, surpassing the human baseline of 78.2%. Experimental results further demonstrate that accumulated knowledge significantly enhances long-term automation performance.
📝 Abstract
What if a browser agent could learn your work simply by watching you do it? We present cotomi Act, a browser-based computer-using agent that combines reliable multi-step task execution with persistent organizational knowledge learned from user behavior. For execution, an agent scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling via best-of-N action selection achieves 80.4% on the 179-task WebArena human-evaluation subset, exceeding the reported 78.2% human baseline. For organizational knowledge, a behavior-to-knowledge pipeline passively observes the user's browsing and progressively abstracts it into artifacts (task boards, wiki) exposed through a shared workspace editable by both user and agent. A controlled proxy evaluation confirms that task success improves as behavior-derived knowledge accumulates. In our live demonstration, attendees interact with the system in a real browser, issuing tasks and observing end-to-end autonomous execution and shared knowledge management.