Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses the poor generalizability and heavy reliance on labeled data of conventional supervised facial affect recognition models in online learning scenarios. We pioneer the investigation of open-source vision-language models (VLMs)—specifically Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct—for zero-shot academic affect recognition (e.g., confusion, distraction, happiness), leveraging zero-shot prompting and facial expression semantic parsing. Evaluated systematically on a dataset of 5,000 images spanning five affect categories, results show that Qwen2.5-VL-7B-Instruct achieves superior overall performance, particularly excelling in confusion recognition (highest accuracy) and happiness recognition (strong robustness), while distraction remains challenging. This work demonstrates the feasibility of directly deploying off-the-shelf VLMs—without fine-tuning—for educational affect diagnostics, establishing a novel paradigm for low-resource, cross-context intelligent teaching feedback.

Technology Category

Application Category

📝 Abstract

Students' academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students' academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students' happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students' confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.

Problem

Research questions and friction points this paper is trying to address.

Detect students' academic emotions via facial expressions

Overcome generalization issues in traditional emotion analysis methods

Evaluate VLMs' zero-shot performance in online learning environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for emotion detection

Employs zero-shot prompting for generalization

Analyzes facial expressions in online learning

🔎 Similar Papers

The Face of Populism: Examining Differences in Facial Emotional Expressions of Political Leaders Using Machine Learning