🤖 AI Summary
This study investigates the divergence between human and machine vision models in perceiving semantic boundaries within ambiguous images. By generating a continuous spectrum of semantically ambiguous stimuli through interpolation in CLIP’s embedding space and conducting psychophysical experiments, the authors systematically compare where humans and models place conceptual category boundaries. For the first time, psychophysics-inspired semantic ambiguity is employed as an interpretability probe, revealing that human judgments align more closely with CLIP embeddings, whereas machine classifiers exhibit a bias toward the “rabbit” category. Additionally, the influence of guidance scale on perceptual judgments is found to be markedly stronger in humans than in models, offering a novel perspective on the alignment—or misalignment—between human and artificial visual perception.
📝 Abstract
The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rabbit'', and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ''rabbit'', whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.