FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in emotional video captioning that the relative importance of factual accuracy and emotional expression varies across samples, a nuance existing methods struggle to capture. To this end, the authors propose FACE-net, a retrieval-augmented framework that jointly models factual and emotional semantics within a unified architecture. The approach features a dynamic bias-adjustment routing mechanism that adaptively guides the generation process according to each sample’s fact–emotion preference. Key innovations include semantic enrichment from external corpora, uncertainty-aware calibration of factual triplets, expert-guided emotional query generation, and interaction with an emotion lexicon. Experimental results demonstrate that FACE-net significantly improves the overall performance of generated captions in both factual correctness and emotional appropriateness.

Technology Category

Application Category

📝 Abstract
Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.
Problem

Research questions and friction points this paper is trying to address.

Emotional Video Captioning
factual-emotional bias
factual calibration
emotion augmentation
retrieval-enhanced generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-enhanced
Factual Calibration
Emotion Augmentation
Uncertainty Estimation
Dynamic Bias Adjustment
🔎 Similar Papers
No similar papers found.
W
Weidong Chen
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
C
Cheng Ye
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
Zhendong Mao
Zhendong Mao
University of Science and Technology of China
CV,NLP
Peipei Song
Peipei Song
University of Science and Technology of China
MultimediaComputer VisionMachine Learning
X
Xinyan Liu
School of Computer Science and Technology, Harbin Institute of Technology (Weihai), Weihai, China
L
Lei Zhang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
X
Xiaojun Chang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
Y
Yongdong Zhang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China; and Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230027, China