ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Maintaining character identity consistency—e.g., hairstyle, clothing, and body shape—across diverse scenes remains challenging in text-to-video (T2V) generation. To address this, we propose ContextAnyone, the first context-aware T2V framework for identity-consistent generation. Built upon the DiT architecture, it introduces three key components: (1) an Emphasize-Attention module to strengthen reference-image feature integration; (2) Gap-RoPE positional encoding to explicitly decouple reference-image tokens from video tokens; and (3) a dual-guided loss to suppress identity drift. Leveraging only a single reference image, ContextAnyone achieves fine-grained contextual fidelity across multiple actions and scenes. Extensive experiments demonstrate that our method significantly outperforms existing personalized T2V approaches in both identity consistency and visual quality, enabling robust character representation reconstruction under diverse motion patterns.

Technology Category

Application Category

📝 Abstract

Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose extbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.

Problem

Research questions and friction points this paper is trying to address.

Maintaining consistent character identities across video scenes

Preserving broader contextual cues like hairstyle and outfit

Integrating reference image information to prevent identity drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses a context-aware diffusion framework with a reference image.

Integrates an Emphasize-Attention module to reinforce reference features.

Applies dual-guidance loss and Gap-RoPE embedding for consistency.

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence