CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing CLIP-based emotion recognition methods, which struggle to capture individual-specific subtle expressions and rely on computationally expensive language models for prompt generation. To overcome these challenges, the authors propose CLIP-AU, a lightweight temporal framework that, for the first time, directly incorporates interpretable Action Unit (AU) semantics as structured textual prompts into CLIP. Furthermore, they introduce CLIP-AUTT, a test-time personalization approach that leverages entropy-guided temporal window selection and dynamic AU prompt refinement to achieve individual adaptation and temporal consistency without requiring model fine-tuning or auxiliary language models. Evaluated on three fine-grained video emotion datasets—BioVid, StressID, and BAH—the proposed method significantly outperforms current CLIP-based and test-time adaptation approaches, demonstrating superior robustness and personalized recognition capability.
📝 Abstract
Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.
Problem

Research questions and friction points this paper is trying to address.

emotion recognition
personalization
Action Units
fine-grained
test-time adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action Units
Vision-Language Models
Test-Time Personalization
Fine-Grained Emotion Recognition
Prompt Tuning
🔎 Similar Papers
No similar papers found.