FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses key challenges in audio-driven portrait video generation—namely, inaccurate lip synchronization, unnatural motion dynamics, and a misalignment between existing evaluation metrics and human perception—by proposing a multimodal autoregressive generative framework. The approach leverages a large multimodal language model to construct human-aligned evaluation signals and integrates perceptual and temporal consistency regularizers into a composite reward function. The generator is subsequently fine-tuned via Group Relative Policy Optimization using reinforcement learning. Notably, this method is the first to explicitly incorporate human preferences into the audiovisual generation optimization pipeline. Experimental results demonstrate significant improvements over state-of-the-art methods across both automatic metrics and human preference evaluations, with generated videos exhibiting markedly enhanced lip-sync accuracy, expressive facial dynamics, and overall motion naturalness.

Technology Category

Application Category

📝 Abstract

Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

Problem

Research questions and friction points this paper is trying to address.

talking-head video generation

lip synchronization

audio-driven animation

motion naturalness

perceptual evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Audio-Driven Portrait Animation

Multimodal Large Language Models