AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing subject-driven video generation methods lack fine-grained temporal control over subject appearance and disappearance, hindering applications such as compositing, storyboarding, and controllable animation. To address this, we propose the first explicit timestamp-conditioned modeling framework, which encodes subject-associated temporal intervals—without introducing extra attention modules—to enable lightweight, high-precision multi-subject temporal orchestration. Built upon a pre-trained video diffusion model, our method integrates temporal interval encoding, subject-specific text token concatenation, and token-wise feature fusion to ensure cross-frame identity consistency. Quantitative and qualitative evaluations demonstrate state-of-the-art performance in multi-subject identity preservation, video fidelity, and temporal accuracy. Notably, our approach is the first to support millisecond-level controllable generation with multiple reference subjects.

Technology Category

Application Category

📝 Abstract

Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT

Problem

Research questions and friction points this paper is trying to address.

Enables fine-grained temporal control over subject appearance and disappearance in videos

Introduces explicit timestamps conditioning for multi-subject video generation

Strengthens binding between visual identity and captions to reduce ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit timestamps conditioning for subject-driven video generation

Novel positional encoding mechanism for temporal intervals

Subject-descriptive text tokens to strengthen identity-caption binding

🔎 Similar Papers

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance