CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address weak generalization and high pretraining costs in audio-visual emotion recognition (AVER) under natural scenes—caused by pose variations, occlusions, and background noise—this paper proposes a parameter-efficient, modular framework for adapting language-supervised foundation models. Methodologically, it freezes CLIP/CLAP backbones and applies LoRA fine-tuning (updating ≤4.0% parameters); employs ViT-L/14 with a lightweight Transformer to model dynamic facial expressions visually, and mean pooling for robust speech representation audibly; and introduces asymmetric temporal modeling with a simple fusion head for efficient cross-modal interaction. Evaluated on DFEW and MAFW, the method achieves 80.14% and 61.18% weighted average recall, respectively, setting new state-of-the-art performance with only 8M trainable parameters. This validates the effectiveness of jointly preserving semantic priors, applying low-rank adaptation, and maintaining computational efficiency.

Technology Category

Application Category

📝 Abstract

Audiovisual emotion recognition (AVER) in the wild is still hindered by pose variation, occlusion, and background noise. Prevailing methods primarily rely on large-scale domain-specific pre-training, which is costly and often mismatched to real-world affective data. To address this, we present CLAIP-Emo, a modular framework that reframes in-the-wild AVER as a parameter-efficient adaptation of language-supervised foundation models (CLIP/CLAP). Specifically, it (i) preserves language-supervised priors by freezing CLIP/CLAP backbones and performing emotion-oriented adaptation via LoRA (updating ensuremath{le}4.0% of the total parameters), (ii) allocates temporal modeling asymmetrically, employing a lightweight Transformer for visual dynamics while applying mean pooling for audio prosody, and (iii) applies a simple fusion head for prediction. On DFEW and MAFW, CLAIP-Emo (ViT-L/14) achieves 80.14% and 61.18% weighted average recall with only 8M training parameters, setting a new state of the art. Our findings suggest that parameter-efficient adaptation of language-supervised foundation models provides a scalable alternative to domain-specific pre-training for real-world AVER. The code and models will be available at href{https://github.com/MSA-LMC/CLAIP-Emo}{https://github.com/MSA-LMC/CLAIP-Emo}.

Problem

Research questions and friction points this paper is trying to address.

Addresses audiovisual emotion recognition challenges in the wild

Reduces reliance on costly domain-specific pre-training methods

Enables parameter-efficient adaptation using language-supervised models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LoRA for parameter-efficient emotion adaptation

Asymmetric temporal modeling with lightweight Transformer

Simple fusion head for audiovisual emotion prediction

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs