Scaling Zero-Shot Reference-to-Video Generation

πŸ“… 2025-12-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing reference-to-video (R2V) generation methods heavily rely on costly, scarce explicit image-video-text triplet annotations, severely limiting scalability and practical deployment. This paper introduces Saberβ€”the first zero-shot R2V framework that requires no triplet supervision, trained exclusively on large-scale video-text pairs. Its core innovations are: (1) a mask-aware attention mechanism that disentangles identity preservation from motion modeling, effectively eliminating copy-paste artifacts; and (2) a reference-aware masking augmentation strategy enabling identity-consistent video generation from single or multiple reference images. Evaluated on the OpenS2V-Eval benchmark, Saber significantly outperforms triplet-supervised methods, achieving state-of-the-art performance in both generation quality and generalization across varying numbers of reference images.

Technology Category

Application Category

πŸ“ Abstract
Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
Problem

Research questions and friction points this paper is trying to address.

Eliminates need for expensive reference image-video-text triplets
Generates identity-consistent videos from reference images without explicit R2V data
Mitigates copy-paste artifacts in reference-to-video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot framework bypasses expensive triplet data
Masked training strategy learns identity-consistent representations
Mask augmentation reduces copy-paste artifacts in generation
πŸ”Ž Similar Papers
No similar papers found.