🤖 AI Summary
This work addresses a critical security vulnerability arising from the mismatch between safety guardrail models and the extended context windows of large language models (LLMs). Existing guardrails often truncate or segment long inputs, creating blind spots that adversaries can exploit through cross-segment malicious instruction injection. The paper introduces the first prompt overflow attack, which disperses malicious instructions across an excessively long prompt and combines prompt fragmentation, benign filler content, and contextual reconstruction to evade detection by state-of-the-art guardrails—including Meta’s Llama Prompt Guard, IBM’s Granite Guardian, and DeBERTa-based detectors. Experimental results demonstrate that these crafted prompts successfully bypass security checks while remaining fully executable by downstream LLMs, exposing a significant flaw in current long-context safety mechanisms.
📝 Abstract
Guardrail models (a.k.a. safety checkers) are widely deployed to screen user inputs before they reach large language models (LLMs), serving as a primary defense against prompt injection attacks. Due to strict context constraints, these models handle overlength prompts through truncation or segmentation-based inspection. While prior work has focused on semantic adversarial inputs, the security implications of these long-input processing mechanisms remain largely unexplored. In this paper, we identify a critical blind spot arising from the mismatch between the limited inspection windows of guardrail models and the substantially larger context inference windows of downstream LLMs. We introduce a novel Prompt Overflow Attack, which exploits this mismatch by fragmenting malicious instructions and interleaving them with benign filler content across an overlong prompt, such that no individual inspected segment appears malicious while the full context remains actionable to the LLM. Through a systematic evaluation against state-of-the-art guardrail models, including Meta Llama Prompt Guard, IBM Granite Guardian, and DeBERTa-based detectors, we demonstrate that prompts reliably detected in short-context settings can evade guardrail models once adversarially manipulated into over-length inputs, yet remain fully actionable by downstream LLMs. We further propose potential defense strategies and outline mitigation directions to strengthen guardrail models.