STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

πŸ“… 2025-12-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current speech-driven portrait animation methods face two key bottlenecks: (1) 2D approaches rely heavily on strong reference frames, limiting motion diversity; (2) 3D-aware methods depend on pre-trained triplane-based generators for inversion, leading to reconstruction artifacts and identity drift. This paper proposes an identity- and viewpoint-aware unified generative framework that eliminates explicit 3D modeling and stringent reference constraints. Key contributions include: (1) a soft identity constraint coupled with implicit 3D awareness; (2) temporal-to-spatial adaptive learning, disentangled multi-view modeling, and self-enforced long-context training; and (3) integration of spatio-temporal autoregressive diffusion, lip-reading alignment, ID-embedding guidance, and three-stage disentangled modeling. Extensive experiments demonstrate state-of-the-art performance across multi-task generalization, motion diversity, identity fidelity, and viewpoint consistency.

Technology Category

Application Category

πŸ“ Abstract
This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Generates identity-aware talking portraits from speech without strict reference constraints
Synthesizes free-viewpoint talking portraits without relying on 3D reconstruction
Overcomes limited motion diversity and identity drift in existing animation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identity-aware spatio-temporal video diffusion model
Decoupled learning for view consistency and temporal coherence
Self-forcing training scheme for longer temporal contexts
πŸ”Ž Similar Papers
No similar papers found.