Contextual Emotion Recognition using Large Vision Language Models

📅 2024-05-14

🏛️ IEEE/RJS International Conference on Intelligent RObots and Systems

📈 Citations: 2

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing facial-expression-only approaches to human emotion recognition in real-world scenarios suffer from poor generalization due to insufficient contextual grounding. Method: This paper proposes a contextualized emotion understanding framework that jointly models body pose, environmental context, and commonsense reasoning. It systematically evaluates large vision-language models (VLMs) for fine-grained contextual emotion recognition under zero-shot and few-shot fine-tuning settings, introducing a dual-path multimodal architecture: (i) end-to-end VLM-based joint reasoning and (ii) a two-stage pipeline comprising image captioning followed by pure language-model inference. Contribution/Results: On the EMOTIC benchmark, fine-tuning only a small-scale VLM surpasses state-of-the-art unimodal and multimodal baselines. The method establishes a novel paradigm for embodied agents to achieve robust, context-sensitive affective perception and interaction.

Technology Category

Application Category

📝 Abstract

How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.

Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition

Computer Vision

Artificial Intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion Recognition

Few-shot Learning

AI Assistants

🔎 Similar Papers

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models