LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

To address the degradation of identity fidelity across both spatial and temporal dimensions in text-to-video generation, this paper proposes a Local Autoregressive Diffusion Transformer (LAR-DiT) framework for high-fidelity, identity-consistent personalized video synthesis. Methodologically, we introduce a local routing mechanism to enhance latent-space representations of critical regions (e.g., faces), incorporate a temporal autoregressive module to capture long-range inter-frame dependencies, and integrate temporal patching, weighted local structural fusion, and bias correction. Extensive experiments demonstrate that LAR-DiT achieves state-of-the-art performance in identity consistency (ID-Consistency) and video quality metrics (FVD, FID), significantly outperforming existing DiT-based baselines. The source code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

In this paper, we present LaVieID, a novel underline{l}ocal underline{a}utoregressive underline{vi}dunderline{e}o diffusion framework designed to tackle the challenging underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.

Problem

Research questions and friction points this paper is trying to address.

Addresses identity loss in text-to-video generation

Mitigates facial feature interference via local modeling

Enhances inter-frame identity consistency temporally

Innovation

Methods, ideas, or system contributions that make the work stand out.

Local router for facial structure representation

Temporal autoregressive module for consistency

Weighted combinations of fine-grained features

🔎 Similar Papers

Latte: Latent Diffusion Transformer for Video Generation

2024-01-05arXiv.orgCitations: 204

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence