🤖 AI Summary
RAG-enhanced large language models (LLMs) face critical privacy and security risks—including data leakage and poisoning attacks—in sensitive domains such as healthcare and finance; existing work lacks a controllable, end-to-end interception mechanism for the query–response pipeline. To address this, we propose the first AI firewall specifically designed for RAG systems, featuring a novel two-stage control paradigm: “activation shift detection” followed by “semantic divergence mitigation.” Our approach integrates neural activation analysis, semantic similarity modeling, and a lightweight intervention module, and is compatible with mainstream open-source LLMs (e.g., Llama3, Vicuna, Mistral). Evaluated on four standard benchmarks including MSMARCO, it achieves an AUROC of 0.909+, effectively blocking malicious queries and harmful responses while preserving response safety and task performance. This work fills a fundamental gap in end-to-end query-flow governance for RAG systems.
📝 Abstract
Retrieval-Augmented Generation (RAG) has significantly enhanced the factual accuracy and domain adaptability of Large Language Models (LLMs). This advancement has enabled their widespread deployment across sensitive domains such as healthcare, finance, and enterprise applications. RAG mitigates hallucinations by integrating external knowledge, yet introduces privacy risk and security risk, notably data breaching risk and data poisoning risk. While recent studies have explored prompt injection and poisoning attacks, there remains a significant gap in comprehensive research on controlling inbound and outbound query flows to mitigate these threats. In this paper, we propose an AI firewall, ControlNET, designed to safeguard RAG-based LLM systems from these vulnerabilities. ControlNET controls query flows by leveraging activation shift phenomena to detect adversarial queries and mitigate their impact through semantic divergence. We conduct comprehensive experiments on four different benchmark datasets including Msmarco, HotpotQA, FinQA, and MedicalSys using state-of-the-art open source LLMs (Llama3, Vicuna, and Mistral). Our results demonstrate that ControlNET achieves over 0.909 AUROC in detecting and mitigating security threats while preserving system harmlessness. Overall, ControlNET offers an effective, robust, harmless defense mechanism, marking a significant advancement toward the secure deployment of RAG-based LLM systems.