Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video multimodal large language models (Video MLLMs) struggle with fine-grained dynamic facial expression understanding due to insufficient training data and coarse-grained visual modeling. Method: We introduce FaceTrack-MM, a novel model featuring a lightweight, principal-face-focused visual encoder for precise facial trajectory modeling; design a multi-dimensional consistency metric integrating event extraction, relation classification, and longest-common-subsequence (LCS)-based temporal alignment; and construct the first instruction-tuning dataset for facial expression understanding (5,033 videos, >700K tokens). Contribution/Results: We release FEC-Bench—the first dedicated benchmark for facial expression comprehension—and open-source all data, code, and evaluation tools. Experiments show FaceTrack-MM significantly outperforms state-of-the-art Video MLLMs on FEC-Bench, enabling fine-grained, temporally consistent recognition and description of principal subjects’ expressions in multi-person scenarios.

Technology Category

Application Category

📝 Abstract
Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Video Language Models
Insufficient Training Data
Limited Visual Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Facial Expression Dataset
FaceTrack-MM Model
FEC-Bench Test
🔎 Similar Papers
No similar papers found.
J
Jiaxin Zhao
Tongyi Group, Alibaba
Boyuan Sun
Boyuan Sun
Nankai University
Computer VisionMulti-Modal Large Language ModelSemantic Segmentation
X
Xiang Chen
Tongyi Group, Alibaba
X
Xihan Wei
Tongyi Group, Alibaba