Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge of distribution shifts caused by individual differences in video-based facial expression recognition, as well as the limitations of existing test-time adaptation (TTA) methods—namely their high computational cost and susceptibility to noise from pseudo-labels. To this end, we propose a gradient-free, cache-based TTA approach that leverages a tri-cache collaboration mechanism comprising a personalized source cache and positive/negative target caches, along with a three-gated update strategy. This framework effectively integrates source-domain prototypes with reliable target-domain samples, mitigating pseudo-label noise accumulation while enabling efficient personalization of vision-language models. By incorporating temporal consistency modeling and embedding fusion, our method significantly outperforms current TTA approaches on the BioVid, StressID, and BAH datasets, maintaining high accuracy under both subject- and environment-induced distribution shifts while substantially reducing computational and memory overhead.

Technology Category

Application Category

📝 Abstract

Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Facial Expression Recognition

Test-Time Adaptation

Distribution Shift

Model Personalization

Video Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Adaptation

Cache Personalization

Vision-Language Models