🤖 AI Summary
This study identifies a novel security vulnerability in instruction-following retrievers: their instruction-following capability can be precisely exploited by adversarial queries to induce high-frequency retrieval of harmful content. Method: We construct a curated adversarial query set and a harmful document corpus, then empirically evaluate six state-of-the-art retrieval models—including NV-Embed and LLM2Vec—under both standalone and end-to-end RAG settings. Contribution/Results: Most models exhibit over 50% harmful passage recall under adversarial queries (LLM2Vec reaches 61.35%). Critically, even safety-aligned large language models (e.g., Llama3) generate policy-violating outputs when fed contaminated retrieval results. This work provides the first systematic evidence that the instruction-following mechanism itself constitutes a new attack surface, and demonstrates that current safety alignment techniques are ineffective against the inducement effect of maliciously retrieved content.
📝 Abstract
Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for>50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.