Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) remain vulnerable to decomposition attacks—where malicious objectives are fragmented into seemingly benign subtasks to evade shallow alignment mechanisms. Method: We propose the first lightweight, serialized monitoring framework that employs prompt-engineered external monitors to perform cumulative, cross-step long-horizon intent assessment and real-time interception. Contribution/Results: We introduce the largest publicly available decomposition attack benchmark to date, spanning multimodal tasks including question answering, text-to-image generation, and agent-based reasoning. Experiments demonstrate a 93% defense success rate on GPT-4o—significantly outperforming state-of-the-art reasoning models such as o3-mini—while exhibiting strong robustness against random task injection. Our framework reduces inference overhead by 90% and latency by 50%, enabling efficient, scalable safety enforcement without compromising responsiveness.

Technology Category

Application Category

📝 Abstract
Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that random tasks can be injected into the decomposed subtasks to further obfuscate malicious intents. To defend in real time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each subtask. We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor. Moreover, it remains robust against random task injection and cuts cost by 90% and latency by 50%. Our findings suggest that lightweight sequential monitors are highly effective in mitigating decomposition attacks and are viable in deployment.
Problem

Research questions and friction points this paper is trying to address.

Detecting malicious intent in decomposed LLM subtasks
Defending against obfuscated long-range harmful goals
Real-time lightweight monitoring for sequential attack prevention
Innovation

Methods, ideas, or system contributions that make the work stand out.

External monitor observes high-granularity conversation
Lightweight sequential framework cumulatively evaluates subtasks
Prompt engineered monitor achieves 93% defense rate
🔎 Similar Papers
No similar papers found.