Visual Persuasion: What Influences Decisions of Vision-Language Models?

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of systematic understanding of visual-language models’ (VLMs) preferences in image selection tasks, which hinders the identification of underlying visual decision biases and safety risks. The authors propose the first visual utility inference framework grounded in revealed preference theory, constructing controlled image selection tasks where input images undergo systematic visual perturbations—such as changes in composition, lighting, and background—via generative models. By iteratively optimizing visual prompts to maximize VLM selection likelihood, the method extends text prompt optimization into the visual domain and establishes an automated interpretability pipeline. Large-scale experiments demonstrate that the optimized visual edits significantly increase the probability of VLM selection in pairwise comparisons, revealing stable visual preference patterns and offering an efficient, proactive approach for auditing visual safety in AI agents.

Technology Category

Application Category

📝 Abstract

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

Problem

Research questions and friction points this paper is trying to address.

visual preference

vision-language models

image-based decision

visual bias

AI safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual prompting

vision-language models

revealed preference