🤖 AI Summary
This work investigates whether visual models can transfer and manipulate concept-level attributes in natural images beyond merely recognizing static concepts. To this end, we introduce VisAnalog, the first multi-step visual analogy benchmark for natural images, comprising 617 human-verified analogy questions (A:B::C:?) constructed via programmatically controlled transformations such as scaling, flipping, and hue rotation. We further propose a program-conditioned evaluation protocol to disentangle errors arising from relational reasoning versus transformation execution. Experimental results reveal that state-of-the-art vision-language models achieve substantially lower accuracy than humans, with performance degrading sharply as the number of transformation steps increases, highlighting relational reasoning as a key bottleneck.
📝 Abstract
A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\!:\!B::C\!:\,?$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a multiple-choice question about $D$. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when $D$ is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from $A \rightarrow B$ is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at https://huggingface.co/datasets/zli99/VisAnalog.