Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Video engagement recognition is severely hindered by substantial subjective label noise. To address this, we propose a vision-language model (VLM)-based label optimization framework. First, VLMs perform semantic consistency verification and refinement of original annotations; concurrently, behavioral cue questionnaires are employed to establish a sample reliability hierarchy. Second, we design a reliability-aware curriculum learning strategy that dynamically introduces high-confidence samples while incorporating low-reliability ones via soft-label distillation. Crucially, our approach requires no additional human annotation, effectively mitigating subjective noise. Evaluated on three major benchmarks—EngageNet, DREAMS, and PAFE—our method achieves up to a 1.21% improvement in F1 score over state-of-the-art methods, demonstrating superior robustness and generalization.

Technology Category

Application Category

📝 Abstract

Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

Problem

Research questions and friction points this paper is trying to address.

Address subjective noisy labels in video engagement recognition

Leverage Vision Language Models to refine annotation quality

Enhance model training with curriculum learning strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging VLMs to refine noisy engagement annotations

Using questionnaires to split data by reliability levels

Combining curriculum learning with soft label refinement

🔎 Similar Papers

No similar papers found.