Contextual Emotion Recognition using Large Vision Language Models

πŸ“… 2024-05-14
πŸ›οΈ IEEE/RJS International Conference on Intelligent RObots and Systems
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing facial-expression-only approaches to human emotion recognition in real-world scenarios suffer from poor generalization due to insufficient contextual grounding. Method: This paper proposes a contextualized emotion understanding framework that jointly models body pose, environmental context, and commonsense reasoning. It systematically evaluates large vision-language models (VLMs) for fine-grained contextual emotion recognition under zero-shot and few-shot fine-tuning settings, introducing a dual-path multimodal architecture: (i) end-to-end VLM-based joint reasoning and (ii) a two-stage pipeline comprising image captioning followed by pure language-model inference. Contribution/Results: On the EMOTIC benchmark, fine-tuning only a small-scale VLM surpasses state-of-the-art unimodal and multimodal baselines. The method establishes a novel paradigm for embodied agents to achieve robust, context-sensitive affective perception and interaction.

Technology Category

Application Category

πŸ“ Abstract
How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition
Computer Vision
Artificial Intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion Recognition
Few-shot Learning
AI Assistants
πŸ”Ž Similar Papers
Yasaman Etesam
Yasaman Etesam
Simon Fraser University
Γ–
Γ–zge Nilay YalΓ§in
Simon Fraser University, BC, Canada
C
Chuxuan Zhang
Simon Fraser University, BC, Canada
A
Angelica Lim
Simon Fraser University, BC, Canada