BachVid: Training-Free Video Generation with Consistent Background and Character

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating multiple videos with consistent characters and backgrounds remains challenging: existing approaches rely either on reference images or extensive training, and often ensure only character consistency. This paper introduces BachVid—the first training-free, reference-free framework for consistent video generation. Its core innovation lies in analyzing the attention mechanisms and intermediate features of Diffusion Transformers (DiTs) to automatically extract foreground masks, identify cross-frame correspondence points, and cache/reuse identity-relevant intermediate variables to inject consistency into new generations. BachVid jointly models inter-video consistency for both foreground subjects and background scenes. Experiments demonstrate that, under zero-training conditions, BachVid significantly improves visual consistency across multiple generated videos—achieving stable, efficient performance while eliminating reliance on external supervision signals or manually designed modular components.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.
Problem

Research questions and friction points this paper is trying to address.

Achieving consistent character and background across multiple generated videos
Eliminating dependency on reference images or extensive training
Leveraging attention mechanisms for foreground-background consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free video generation without reference images
Leveraging DiT attention for foreground masks and matching
Injecting cached variables to ensure foreground-background consistency
🔎 Similar Papers
No similar papers found.
H
Han Yan
MoE Key Lab of Artificial, AI Institute, Shanghai Jiao Tong University
X
Xibin Song
Vertex Lab
Yifu Wang
Yifu Wang
Tencent XR Vision Labs
Computer VisionRoboticsEvent-based VisionSLAMVisual Odometry
H
Hongdong Li
Australian National University
Pan Ji
Pan Ji
Ph.D., ex Tencent XR Vision Labs
Computer visionmachine learning3D visionGraphics
C
Chao Ma
MoE Key Lab of Artificial, AI Institute, Shanghai Jiao Tong University