V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Video personalization faces two key challenges: high computational cost due to reliance on large-scale video fine-tuning, and difficulty in maintaining cross-frame fine-grained appearance consistency. This paper proposes a diffusion-based personalization method that requires no video fine-tuning—only a few reference images suffice for high-fidelity identity preservation. Our core contribution is a training-free coarse-to-fine appearance adaptation framework: (1) semantic alignment via LoRA-augmented image encoding and adaptive subject embedding; and (2) precise identity injection and cross-frame representation correction through RoPE-free mid-layer attention feature matching and mask-controllable value warping—both achieved without architectural modification. Experiments demonstrate that our method achieves personalized generation quality on par with full video fine-tuning, while significantly improving appearance consistency and text-motion alignment, all at substantially reduced computational overhead.

Technology Category

Application Category

📝 Abstract

Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query--key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.

Problem

Research questions and friction points this paper is trying to address.

Enhances video personalization with fine-grained appearance consistency

Reduces computational cost by avoiding heavy video-based finetuning

Maintains prompt alignment and motion dynamics in generated videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free coarse-to-fine personalization framework

Lightweight coarse adaptation using image-only LoRA

Inference-time value warping via semantic correspondences

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Internship 3D Scene Editing and Generation

Bosch Group

Hildesheim, NDS, DE

AI Research Scientist, Computer Vision - Facebook Video Intelligence