🤖 AI Summary
This work addresses the challenge of predicting dynamic affective states—specifically valence and arousal—in unconstrained environments by proposing a novel approach that balances interpretability and performance. The method integrates domain knowledge–driven handcrafted features from facial geometry and acoustic signals, which are first transformed into natural language descriptions. For the first time, a pretrained language model is leveraged to generate semantic contextual embeddings that serve as high-level priors for emotional dynamics. Innovatively employing the language model as a semantic modulator, this framework overcomes the limitations of conventional end-to-end black-box architectures. Evaluated on the Aff-Wild2 and SEWA datasets, the proposed approach significantly outperforms baselines relying solely on handcrafted features or deep embeddings, achieving more accurate and interpretable affect modeling.
📝 Abstract
Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI. While deep neural embeddings dominate contemporary approaches, they often lack interpretability and limit expert-driven refinement. We propose a novel framework that uses Language Models (LMs) as semantic context conditioners over handcrafted affect descriptors to model changes in Valence and Arousal. Our approach begins with interpretable facial geometry and acoustic features derived from structured domain knowledge. These features are transformed into symbolic natural-language descriptions encoding their affective implications. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics. Unlike end-to-end black-box pipelines, our framework preserves feature transparency while leveraging the contextual abstraction capabilities of LMs. We evaluate the proposed method on the Aff-Wild2 and SEWA datasets for affect change prediction. Experimental results show consistent improvements in accuracy for both Valence and Arousal compared to handcrafted-only and deep-embedding baselines. Our findings demonstrate that semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures