LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the degradation of identity fidelity across both spatial and temporal dimensions in text-to-video generation, this paper proposes a Local Autoregressive Diffusion Transformer (LAR-DiT) framework for high-fidelity, identity-consistent personalized video synthesis. Methodologically, we introduce a local routing mechanism to enhance latent-space representations of critical regions (e.g., faces), incorporate a temporal autoregressive module to capture long-range inter-frame dependencies, and integrate temporal patching, weighted local structural fusion, and bias correction. Extensive experiments demonstrate that LAR-DiT achieves state-of-the-art performance in identity consistency (ID-Consistency) and video quality metrics (FVD, FID), significantly outperforming existing DiT-based baselines. The source code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
In this paper, we present LaVieID, a novel underline{l}ocal underline{a}utoregressive underline{vi}dunderline{e}o diffusion framework designed to tackle the challenging underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.
Problem

Research questions and friction points this paper is trying to address.

Addresses identity loss in text-to-video generation
Mitigates facial feature interference via local modeling
Enhances inter-frame identity consistency temporally
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local router for facial structure representation
Temporal autoregressive module for consistency
Weighted combinations of fine-grained features
🔎 Similar Papers