π€ AI Summary
This study addresses the lack of systematic evaluation of privacy risks in large vision-language models (LVLMs) under semantic visual attacks, such as OCR injection and contextual personally identifiable information (PII) leakage. The authors propose VisualLeakBench, a benchmark comprising 1,000 synthetically generated adversarial images and 50 real-world screenshots, to conduct the first comprehensive audit of mainstream LVLMs in privacy-sensitive scenarios. Their analysis reveals a prevalent βleak-then-warnβ behavior across models, with defensive prompting efficacy heavily dependent on template design. Experimental results show that Claude 4 exhibits the lowest attack success rate (ASR) under OCR injection (14.2%) but the highest PII leakage rate (74.4%), whereas Grok-4 demonstrates the strongest overall performance with a PII ASR of 20.4%. Targeted prompts can significantly suppress or even eliminate leakage in certain models.
π Abstract
As Large Vision-Language Models (LVLMs) are increasingly deployed in agent-integrated workflows and other deployment-relevant settings, their robustness against semantic visual attacks remains under-evaluated -- alignment is typically tested on explicit harmful content rather than privacy-critical multimodal scenarios. We introduce VisualLeakBench, an evaluation suite to audit LVLMs against OCR Injection and Contextual PII Leakage using 1,000 synthetically generated adversarial images with 8 PII types, validated on 50 in-the-wild (IRL) real-world screenshots spanning diverse visual contexts. We evaluate four frontier systems (GPT-5.2, Claude~4, Gemini-3 Flash, Grok-4) with Wilson 95% confidence intervals. Claude~4 achieves the lowest OCR ASR (14.2%) but the highest PII ASR (74.4%), exhibiting a comply-then-warn pattern -- where verbatim data disclosure precedes any safety-oriented language. Grok-4 achieves the lowest PII ASR (20.4%). A defensive system prompt eliminates PII leakage for two models, reduces Claude~4's leakage from 74.4% to 2.2%, but has no effect on Gemini-3 Flash on synthetic data. Strikingly, IRL validation reveals Gemini-3 Flash does respond to mitigation on real-world images (50% to 0%), indicating that mitigation robustness is template-sensitive rather than uniformly absent. We release our dataset and code for reproducible robustness and safety evaluation of deployment-relevant vision-language systems.