Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of generating 3D facial expressions that align with human preferences, convey appropriate emotions, and suit natural dyadic interactions. The authors propose an identity-agnostic facial expression generation framework that formulates expression synthesis as a motor learning problem. By jointly modeling visual, linguistic, and motor signals, the approach dynamically responds to a conversational partner’s multimodal cues. The method integrates imitation learning with reinforcement learning from human feedback to establish a closed-loop optimization mechanism, leveraging a 3D morphable model for context-aware expression generation. Evaluated on two benchmark datasets, the proposed approach significantly outperforms existing methods, producing expressions that are consistently preferred by human evaluators and demonstrate superior emotional expressiveness and alignment with social interaction dynamics.

Technology Category

Application Category

📝 Abstract

Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.

Problem

Research questions and friction points this paper is trying to address.

facial expression generation

human preference

dyadic interaction

emotional appropriateness

social alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

human preference alignment

facial expression generation

reinforcement learning from human feedback