🤖 AI Summary
This study investigates whether multimodal models (e.g., CLIP) exhibit cognitive alignment with humans in color perception and naming—particularly regarding cultural context and abstraction level. We propose the first cognitive evaluation framework inspired by the board game *Hues & Cues*, transforming its gameplay mechanics into a quantifiable, cross-subject (AI vs. human) benchmarking protocol. Through systematic experiments calibrated against human behavioral baselines, we find that CLIP achieves high overall perceptual alignment but exhibits significant deviations on culturally loaded color terms (e.g., “Mordant tones”, “Tiffany blue”) and high-level abstract descriptions (e.g., metaphorical or affective naming). These gaps reveal culturally embedded biases and hierarchical reasoning deficits that conventional benchmarks fail to capture. Our work pioneers a game-informed paradigm for assessing human-AI cognitive similarity, offering a novel, ecologically grounded methodology for evaluating alignment beyond standard vision-language metrics.
📝 Abstract
Playing games is inherently human, and a lot of games are created to challenge different human characteristics. However, these tasks are often left out when evaluating the human-like nature of artificial models. The objective of this work is proposing a new approach to evaluate artificial models via board games. To this effect, we test the color perception and color naming capabilities of CLIP by playing the board game Hues & Cues and assess its alignment with humans. Our experiments show that CLIP is generally well aligned with human observers, but our approach brings to light certain cultural biases and inconsistencies when dealing with different abstraction levels that are hard to identify with other testing strategies. Our findings indicate that assessing models with different tasks like board games can make certain deficiencies in the models stand out in ways that are difficult to test with the commonly used benchmarks.