ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

๐Ÿ“… 2025-03-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing methods struggle to synthesize natural, synchronized head, torso, and hand motions in real-time video chat, while lacking fine-grained control over speaking style and facial micro-expressions. This paper introduces the first end-to-end, style-controllable framework for real-time upper-body portrait generation, supporting interactive 30 fps synthesis at 512ร—768 resolution. Our approach features: (1) a hierarchical motion diffusion model that unifies explicit and implicit audio-driven motion generation; (2) audio-conditioned latent-space modeling coupled with explicit gesture signal injection; and (3) a two-stage generation pipelineโ€”motion prediction followed by facial refinement. Implemented on an RTX 4090 GPU, the system achieves real-time inference, significantly improving motion coherence, expressive richness, and stylistic controllability. Experimental results demonstrate high-fidelity, natural upper-body video interaction with strong temporal consistency and semantic alignment to speech.

Technology Category

Application Category

๐Ÿ“ Abstract
Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.
Problem

Research questions and friction points this paper is trying to address.

Synchronize body motions with head movements in real-time video
Achieve fine-grained control over speaking style and facial expressions
Generate expressive upper-body portrait videos with hand gestures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical motion diffusion for synchronized expressions
Explicit hand control for detailed gestures
Real-time 30fps upper-body video generation
๐Ÿ”Ž Similar Papers
Jinwei Qi
Jinwei Qi
Tongyi Lab, Alibaba Group
Artificial Intelligencedeep learningmultimedia
C
Chaonan Ji
Tongyi Lab, Alibaba Group
S
Sheng Xu
Tongyi Lab, Alibaba Group
P
Peng Zhang
Tongyi Lab, Alibaba Group
B
Bang Zhang
Tongyi Lab, Alibaba Group
Liefeng Bo
Liefeng Bo
Head of Applied Computer Vision Lab at Alibaba Group
Machine LearningComputer VisionRobotics