Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study identifies a novel security vulnerability in instruction-following retrievers: their instruction-following capability can be precisely exploited by adversarial queries to induce high-frequency retrieval of harmful content. Method: We construct a curated adversarial query set and a harmful document corpus, then empirically evaluate six state-of-the-art retrieval models—including NV-Embed and LLM2Vec—under both standalone and end-to-end RAG settings. Contribution/Results: Most models exhibit over 50% harmful passage recall under adversarial queries (LLM2Vec reaches 61.35%). Critically, even safety-aligned large language models (e.g., Llama3) generate policy-violating outputs when fed contaminated retrieval results. This work provides the first systematic evidence that the instruction-following mechanism itself constitutes a new attack surface, and demonstrates that current safety alignment techniques are ineffective against the inducement effect of maliciously retrieved content.

Technology Category

Application Category

📝 Abstract

Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for>50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

Problem

Research questions and friction points this paper is trying to address.

Investigates safety risks of instruction-following retrievers.

Examines retrievers' ability to satisfy malicious queries.

Highlights misuse risks with increasing retriever capabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Study malicious query satisfaction by retrievers

Investigate six leading retrievers including NV-Embed

Highlight risks of instruction-following retrievers misuse

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization