Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses two key challenges in real-time audio-driven portrait animation: difficulty in long-term temporal modeling and inconsistent multi-region motion. To this end, we propose the first autoregressive framework specifically designed for streaming speech. Methodologically: (1) we design an audio-stream-driven autoregressive facial motion token generation mechanism; (2) we introduce implicit keypoint modeling coupled with an Efficient Temporal Module (ETM) to explicitly capture subtle physical motions—such as neck muscle deformation and earring oscillation; and (3) we integrate Residual Vector Quantization (Residual VQ) with a lightweight Transformer architecture to enhance computational efficiency. Experiments demonstrate real-time generation at 25 FPS, with inference latency of only 0.92 seconds per second of video—22× faster than diffusion-based methods. Human evaluation confirms significant improvements in fine-motion fidelity and overall visual realism.

Technology Category

Application Category

📝 Abstract

In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.

Problem

Research questions and friction points this paper is trying to address.

Real-time audio-driven portrait animation generation

Preserving natural movement of diverse body parts

Efficient streaming performance with autoregressive motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive motion generation for real-time animation

Facial Motion Latent Generation with Residual VQ

Efficient Temporal Module refines movement authenticity

🔎 Similar Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency