🤖 AI Summary
This study investigates the cultural awareness capabilities of Video Large Language Models (VideoLLMs) in cross-cultural contexts, specifically their ability to recognize sociocultural norms in U.S. and Chinese settings. Method: We introduce the first pragmatics-grounded benchmark—CultNorm—based on Speech Act Theory, comprising over 1,000 annotated video clips. Annotations include speech acts, norm adherence/violation, and nonverbal cues, generated via a human-AI collaborative framework integrating theory-informed prompting, expert verification, and multimodal analysis. Contribution/Results: Experiments reveal that current VideoLLMs exhibit significant deficiencies in detecting norm violations, interpreting Chinese cultural scenarios, leveraging nonverbal evidence, and accurately linking speech acts to corresponding norms. This work establishes the first systematic, culture-aware evaluation benchmark for VideoLLMs, uncovering critical bottlenecks in cultural understanding and providing essential infrastructure and empirical foundations for culturally adaptive model training.
📝 Abstract
As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models'cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.