🤖 AI Summary
This study investigates how scene context shapes human conceptualization and linguistic reference to identical abstract shapes (Tangram figures) and evaluates whether multimodal large language models (MLLMs) exhibit comparable cognitive flexibility. Method: We introduce SceneGram—the first cross-scene, human-annotated dataset—featuring crowd-sourced conceptual descriptions of the same geometric shapes across diverse contextual scenes. Using rigorously controlled cross-scene experiments and human–model comparative analysis, we quantify contextual effects on naming preferences and conceptual expectations. Contribution/Results: We demonstrate that human conceptualization is strongly grounded in embodied context, whereas state-of-the-art MLLMs—even under prompt tuning—exhibit severe deficits in contextual sensitivity and conceptual variability. This work establishes a novel benchmark and theoretical framework for advancing cognitively interpretable vision-language modeling.
📝 Abstract
Research on reference and naming suggests that humans can come up with very different ways of conceptualizing and referring to the same object, e.g. the same abstract tangram shape can be a"crab","sink"or"space ship". Another common assumption in cognitive science is that scene context fundamentally shapes our visual perception of objects and conceptual expectations. This paper contributes SceneGram, a dataset of human references to tangram shapes placed in different scene contexts, allowing for systematic analyses of the effect of scene context on conceptualization. Based on this data, we analyze references to tangram shapes generated by multimodal LLMs, showing that these models do not account for the richness and variability of conceptualizations found in human references.