🤖 AI Summary
Existing prompt injection detection methods are often evaluated under idealized settings that ignore real-world deployment constraints, limiting their practical validity. This work proposes a deployment-aware evaluation framework that systematically compares lexical, semantic, structural, and Transformer-based detectors across out-of-distribution generalization, repeated data splits, and diverse deployment-oriented metrics. We introduce interpretable structural signals that effectively capture hierarchical coverage, system prompt spoofing, role redefinition, and evasion patterns, integrating them with sparse models and strong encoder baselines for detection. Experiments demonstrate that Transformer-based models generally achieve the best performance, while structural signals significantly enhance robustness and reduce false positives in challenging scenarios under specific deployment mechanisms. Our findings further reveal a substantial gap between conventional ranking metrics and actual deployment effectiveness.
📝 Abstract
Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.