ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

📅 2026-01-15

📈 Citations: 1

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the vulnerability of large language model (LLM) agents to indirect prompt injection attacks, which can maliciously hijack their behavior. The authors propose a novel defense framework that uniquely integrates structured reasoning with test-time trajectory selection. Specifically, structured reasoning is employed to parse user queries and detect conflicting instructions, thereby preserving task coherence, while a preference-optimized critic model leverages test-time scaling to dynamically select the optimal reasoning path. Evaluated on the CyberSecEval2 benchmark, the approach reduces attack success rates to 3.6% while maintaining 94.6% task utility, significantly outperforming state-of-the-art defenses such as Meta’s SecAlign. This demonstrates a strong balance between security and practicality in real-world LLM agent deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.

Problem

Research questions and friction points this paper is trying to address.

prompt injection attack

safety alignment

large language models

agentic systems

adversarial robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt injection defense

reasoning-enhanced alignment

test-time scaling