Lynx: Towards High-Fidelity Personalized Video Generation

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging task of generating high-fidelity, identity-consistent, and temporally coherent personalized videos from a single reference image. We propose an identity-aware video generation framework based on the Diffusion Transformer. Its core innovations are two lightweight adapters: (1) the ID-Adapter, which compresses ArcFace facial embeddings into compact identity tokens for explicit identity modeling; and (2) the Ref-Adapter, which injects frozen VAE dense features via a reference path and preserves fine-grained spatiotemporal details through cross-layer cross-attention. The method integrates a Perceiver Resampler with VAE feature fusion to enable efficient identity-guided diffusion-based video synthesis. Evaluated on 800 test samples (40 identities × 20 prompts), our approach achieves state-of-the-art performance, significantly improving face similarity (+12.3%), prompt adherence (+9.7%), and video quality (FVD reduced by 28.5%).

Technology Category

Application Category

📝 Abstract
We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.
Problem

Research questions and friction points this paper is trying to address.

Personalized video generation from single image input
Ensuring high identity fidelity in synthesized videos
Maintaining temporal coherence and visual realism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight adapters for identity fidelity
Perceiver Resampler converts facial embeddings
Cross-attention integrates dense VAE features
🔎 Similar Papers
No similar papers found.
S
Shen Sang
Intelligent Creation, ByteDance
T
Tiancheng Zhi
Intelligent Creation, ByteDance
Tianpei Gu
Tianpei Gu
Research Scientist, ByteDance/TikTok
Computer VisionGenerative Model
J
Jing Liu
Intelligent Creation, ByteDance
Linjie Luo
Linjie Luo
Research Manager at ByteDance AI Lab
Computer GraphicsComputer Vision