Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work exposes a critical vulnerability of large language models (LLMs) in factual recall tasks to adversarial man-in-the-middle (MitM) prompt injection attacks: simple instruction perturbations induce error rates as high as 85.3%. To systematically assess such attacks’ impact on LLMs’ factual memory, we propose Xmera—the first framework integrating closed-book question answering with generative uncertainty modeling. We find that erroneous responses exhibit statistically significant uncertainty signatures. Building on this insight, we design a black-box detection method based on random forests, leveraging response uncertainty metrics to identify injected prompts, achieving an AUC of 96%. Our contribution is threefold: (1) identifying a novel dimension of factual integrity risk in LLMs; (2) providing the first efficient, model-agnostic detection mechanism specifically for factual recall scenarios; and (3) offering a practical, access-free safeguard to enhance LLM deployment security.

Technology Category

Application Category

📝 Abstract

LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to"victim"LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vulnerability of LLMs to adversarial prompt injection attacks

Assessing uncertainty in LLM responses under MitM falsehood injection

Developing defense mechanisms against factual recall corruption in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel MitM framework for LLM prompt injection

Perturbing input to undermine factual recall correctness

Random Forest classifiers detect attacks via uncertainty levels

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

2024-07-01Conference on Empirical Methods in Natural Language ProcessingCitations: 2

💼 Related Jobs

No related jobs found.

Authors to Follow