VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
This work investigates whether visual models can transfer and manipulate concept-level attributes in natural images beyond merely recognizing static concepts. To this end, we introduce VisAnalog, the first multi-step visual analogy benchmark for natural images, comprising 617 human-verified analogy questions (A:B::C:?) constructed via programmatically controlled transformations such as scaling, flipping, and hue rotation. We further propose a program-conditioned evaluation protocol to disentangle errors arising from relational reasoning versus transformation execution. Experimental results reveal that state-of-the-art vision-language models achieve substantially lower accuracy than humans, with performance degrading sharply as the number of transformation steps increases, highlighting relational reasoning as a key bottleneck.
📝 Abstract
A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\!:\!B::C\!:\,?$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a multiple-choice question about $D$. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when $D$ is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from $A \rightarrow B$ is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at https://huggingface.co/datasets/zli99/VisAnalog.
Problem

Research questions and friction points this paper is trying to address.

visual concept transfer
natural images
transformation
visual reasoning
visual language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual concept transfer
analogy reasoning
controlled benchmark
transformation inference
vision-language models
🔎 Similar Papers