Human-Guided Harm Recovery for Computer Use Agents

πŸ“… 2026-04-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

209K/year
πŸ€– AI Summary
This work addresses the challenge of safely recovering from harmful operations executed by large language model (LLM) agents in real-world computer systems. It formally defines, for the first time, the task of β€œharm recovery” and proposes a recovery strategy that integrates human preferences with contextual dependencies. Through user studies, the authors establish human-aligned recovery criteria and train a reward model to rerank multiple recovery plans generated by the agent. Additionally, they introduce BackBench, the first benchmark for harm recovery in computer usage scenarios, comprising 50 tasks. Experimental results demonstrate that the proposed method generates recovery trajectories that significantly outperform both baseline agents and scaffolded approaches based on scoring heuristics in human evaluations.

Technology Category

Application Category

πŸ“ Abstract
As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.
Problem

Research questions and friction points this paper is trying to address.

harm recovery
human preferences
agent safety
post-execution safeguards
computer-use agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

harm recovery
preference alignment
reward modeling
agent safety
BackBench
πŸ”Ž Similar Papers