BQA: Body Language Question Answering Dataset for Video Large Language Models

📅 2024-10-17

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

243K/year

🤖 AI Summary

Current VideoLLMs exhibit weak performance in body language understanding—particularly in accurately recognizing affective states conveyed by high-ambiguity, unconscious nonverbal behaviors (e.g., posture, gestures)—and suffer from significant age and ethnic biases. To address this, we introduce BQA, the first body language question-answering benchmark for video large models: it is built upon authentic human motion videos and annotated with 26 fine-grained emotion categories; it establishes the first systematic evaluation paradigm for body language understanding; and it proposes novel methodologies for multi-dimensional affective semantic alignment and bias auditing. Empirical results demonstrate consistently poor performance and pronounced demographic biases across mainstream VideoLLMs on BQA. The BQA dataset is publicly released, including fully annotated videos, comprehensive bias metadata, and standardized evaluation protocols—enabling reproducible assessment and fairness-aware research.

Technology Category

Application Category

📝 Abstract

A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made significantly biased answers depending on the age group and ethnicity of the individuals in the video. The dataset is available.

Problem

Research questions and friction points this paper is trying to address.

Enabling VideoLLMs to accurately interpret nonverbal body language cues

Addressing model biases in interpreting emotions across age and ethnicity

Validating model performance on 26 emotion labels from body language clips

Innovation

Methods, ideas, or system contributions that make the work stand out.

BQA dataset for body language emotion recognition

Evaluates VideoLLMs on 26 emotion labels

Analyzes biases in VideoLLMs by demographics

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs