Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video identity customization requires simultaneously ensuring identity consistency, text-driven alignment of motion and appearance, and natural motion generation. To address this, we propose a diffusion-based framework conditioned on a single reference image and textual prompt. Our method introduces a multimodal identity fusion module and a temporal-aware identity injection mechanism to maintain cross-frame identity stability; an adaptive motion learning strategy incorporating optical-flow-guided motion heatmap reweighting loss to enhance motion smoothness; and Q-Former for fine-grained text-video alignment. Evaluated on our newly constructed benchmark, Proteus-Bench, our approach achieves significant improvements over prior methods in three key dimensions—identity preservation, text-video alignment fidelity, and motion quality—establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.
Problem

Research questions and friction points this paper is trying to address.

Maintaining identity consistency in video customization
Generating natural motion without unrealistic stiffness
Unifying visual and textual cues for coherent guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Identity Fusion unifies visual and textual cues
Time-Aware Identity Injection modulates identity conditioning dynamically
Adaptive Motion Learning enhances motion realism self-supervisedly
🔎 Similar Papers
No similar papers found.
Guiyu Zhang
Guiyu Zhang
The Chinese University of Hong Kong (Shenzhen)
Computer visionPattern RecognitionMachine Learning
C
Chen Shi
The Chinese University of Hong Kong, Shenzhen
Z
Zijian Jiang
The Chinese University of Hong Kong, Shenzhen
Xunzhi Xiang
Xunzhi Xiang
Nanjing University
J
Jingjing Qian
The Chinese University of Hong Kong, Shenzhen
Shaoshuai Shi
Shaoshuai Shi
Didi Chuxing, Max Planck Institute for Informatics
Computer VisionDeep LearningAutonomous Driving
L
Li Jiang
The Chinese University of Hong Kong, Shenzhen