Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) in integrated applications to prompt injection attacks, where malicious instructions—often injected through multiple sources—exhibit ambiguous semantic boundaries with legitimate context, rendering them difficult to detect. To counter this threat, the authors propose IntrusCoT, a novel approach that uniquely integrates instruction-level chain-of-thought learning with diverse adversarial data synthesis. Through fine-tuning, IntrusCoT enhances the model’s ability to recognize and refuse covertly embedded malicious instructions. Experimental results demonstrate that the method significantly outperforms existing baselines across four mainstream LLMs, achieving strong defensive performance against three critical risks: behavioral deviation, privacy leakage, and harmful outputs—while preserving the models’ original task capabilities.

Technology Category

Application Category

📝 Abstract

Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation

Problem

Research questions and friction points this paper is trying to address.

Prompt Injection

Large Language Models

Security Vulnerabilities

Malicious Instructions

Semantic Boundaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Injection Defense

Diverse Data Synthesis

Instruction-Level Chain-of-Thought