ProjGuard: Safety Monitoring for Computer-Use Agents via Low-Dimensional Projections

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the vulnerability of intelligent agents to security threats such as prompt injection, indirect instructions, and visual attacks in complex interactive environments, where existing defenses are either computationally expensive or offer limited coverage. To overcome these limitations, the authors propose a hierarchical security framework that employs lightweight behavioral trajectory monitoring to continuously extract low-dimensional risk signals for real-time detection of execution deviations. An auxiliary vision-language model is activated only when necessary to perform online correction, thereby combining persistent low-overhead surveillance with on-demand large-model intervention. This approach enables early warning and efficient protection without compromising performance. Experimental results on the OS-Harm and RiosWorld benchmarks demonstrate significant improvements: unsafe behavior rates are reduced to 3% and 4%, respectively, while task completion rates increase to 65% and 64%, substantially outperforming baseline methods.

📝 Abstract

Computer-use agents are increasingly capable of operating on real operating systems, but this capability has also increased the risks posed by prompt injection, indirect instructions, and visual attacks. Existing defenses typically rely on analyzing the prompt or each potentially malicious input with a second large model at inference time, which can limit coverage or increase deployment cost. We propose ProjGuard, an alternative based on behavioral trajectory monitoring. At each step, we derive a lightweight scalar risk signal from the agent's accumulated interaction history and evaluate, online, whether execution is beginning to drift toward an unsafe region. This enables early warnings before the trajectory reaches a potentially harmful action. When an alert is raised, we selectively activate an auxiliary vision-language model to propose a corrected next step and steer execution back toward task completion. Experiments on OS-Harm show that monitoring with on-demand correction reduces the unsafe rate from 16 percent to 3 percent while improving task completion from 59 percent to 65 percent. We further evaluate transfer to RiosWorld, where the method remains competitive, reaching 4 percent unsafe and 64 percent completion. Overall, these results support a hierarchical safety strategy in which always-on monitoring anticipates deviations and activates correction only when needed.

Problem

Research questions and friction points this paper is trying to address.

computer-use agents

safety monitoring

prompt injection

visual attacks

behavioral trajectory

Innovation

Methods, ideas, or system contributions that make the work stand out.

behavioral trajectory monitoring

low-dimensional projection

on-demand correction