Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks

📅 2025-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical vulnerability of vision-language model (VLM)-based dense document retrievers—such as DSE and ColPali—to pixel-level adversarial attacks in screenshot-based retrieval. We systematically discover and quantify a previously unreported pixel poisoning vulnerability at the visual input interface, proposing three single-image injection-based pixel poisoning methods. Experiments demonstrate that injecting just one malicious screenshot suffices to poison 41.9% and 26.4% of top-10 retrieval results for DSE and ColPali, respectively; under targeted attacks, success rates reach 100%, substantially exceeding the robustness of text-only retrievers. To our knowledge, this is the first study to extend adversarial robustness evaluation into the pixel space of cross-modal dense retrieval. Our work establishes a foundational benchmark and delivers a critical security warning for VLM-driven document search systems.

Technology Category

Application Category

📝 Abstract
Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded as vectors to enable effective search and offer a simplified pipeline over traditional text-only methods. In this study, we propose three pixel poisoning attack methods designed to compromise VLM-based retrievers and evaluate their effectiveness under various attack settings and parameter configurations. Our empirical results demonstrate that injecting even a single adversarial screenshot into the retrieval corpus can significantly disrupt search results, poisoning the top-10 retrieved documents for 41.9% of queries in the case of DSE and 26.4% for ColPali. These vulnerability rates notably exceed those observed with equivalent attacks on text-only retrievers. Moreover, when targeting a small set of known queries, the attack success rate raises, achieving complete success in certain cases. By exposing the vulnerabilities inherent in vision-language models, this work highlights the potential risks associated with their deployment.
Problem

Research questions and friction points this paper is trying to address.

Image Search
Pixel Poisoning Attack
Search Accuracy Degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel Poisoning Attack
Visual-Linguistic Model Security
Targeted Attack Resilience