🤖 AI Summary
This work exposes a critical vulnerability of large language models (LLMs) in factual recall tasks to adversarial man-in-the-middle (MitM) prompt injection attacks: simple instruction perturbations induce error rates as high as 85.3%. To systematically assess such attacks’ impact on LLMs’ factual memory, we propose Xmera—the first framework integrating closed-book question answering with generative uncertainty modeling. We find that erroneous responses exhibit statistically significant uncertainty signatures. Building on this insight, we design a black-box detection method based on random forests, leveraging response uncertainty metrics to identify injected prompts, achieving an AUC of 96%. Our contribution is threefold: (1) identifying a novel dimension of factual integrity risk in LLMs; (2) providing the first efficient, model-agnostic detection mechanism specifically for factual recall scenarios; and (3) offering a practical, access-free safeguard to enhance LLM deployment security.
📝 Abstract
LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to"victim"LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.