Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether linguistic and cultural background shapes the attentional cognition patterns of vision-language models (VLMs): specifically, whether Japanese-trained VLMs exhibit holistic processing (emphasizing scene-level context) while English-trained VLMs adopt analytic processing (focusing on individual objects). Leveraging cross-cultural cognitive paradigms, we systematically compare image captions generated by Japanese–English bilingual VLMs, complemented by attention visualization and quantitative analysis. Our findings provide the first empirical evidence that VLMs not only acquire linguistic structure but also implicitly internalize culture-specific cognitive preferences embedded in training data. Japanese VLMs demonstrate significantly enhanced global relational modeling, whereas English VLMs prioritize object-centric representations—aligning closely with well-established human cross-cultural cognition patterns. This work reveals language’s role as a cultural carrier profoundly influencing multimodal model cognition architectures, offering both theoretical grounding and an evaluation framework for culturally aware AI modeling.

Technology Category

Application Category

📝 Abstract
Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.
Problem

Research questions and friction points this paper is trying to address.

Examines cultural cognitive styles in Vision-Language Models
Compares holistic Japanese vs analytical English attentional patterns
Investigates if VLMs reproduce cultural behaviors from training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyze VLMs trained on Japanese and English
Compare holistic vs analytic attention patterns
Show cultural cognition shapes model outputs
🔎 Similar Papers
No similar papers found.
A
Ahmed Sabir
University of Tartu, Estonia
A
Azinovič Gasper
University of Ljubljana, Slovenia
M
Mengsay Loem
Sansan, Inc., Japan
Rajesh Sharma
Rajesh Sharma
University of Tartu
Computational Social ScienceData ScienceSocial Network AnalysisSocial Computing#unitartucs