Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study investigates whether vision-language models (VLMs) exhibit human-like cross-modal semantic integration, using the bouba-kiki effect—a robust perceptual association between rounded/spiky shapes and the pseudowords “bouba”/“kiki”—as a cognitive benchmark. Leveraging CLIP-based ResNet and ViT architectures, we adopt human-inspired prompt designs and Grad-CAM–based visual attention analysis to systematically evaluate VLMs’ shape-phoneme alignment for the first time. Results reveal that current VLMs fail to consistently replicate human-level bouba-kiki preferences, indicating a lack of perception-driven semantic coupling in their cross-modal alignment. We propose the first cognitive interpretability–oriented analytical framework for assessing cross-modal alignment in VLMs, grounded in behavioral and neurocognitive principles. This work provides empirical evidence and methodological foundations for brain-inspired improvements to VLM architectures and training paradigms.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like "bouba" with round shapes and "kiki" with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as model preference, and we use Grad-CAM as a novel way to interpret visual attention in shape-word matching tasks. Our findings show that these models do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both models lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.

Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs for bouba-kiki effect alignment with human cognition

Assess cross-modal shape-word associations in CLIP ResNet and ViT

Compare model performance to human data on modality integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates CLIP variants ResNet and ViT

Uses prompt-based and Grad-CAM methods

Compares model responses with human data

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs