RefAlign: Representation Alignment for Reference-to-Video Generation

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of copy-paste artifacts and multi-subject confusion in reference-based image-to-video generation, which arise from modality misalignment. To this end, the authors propose RefAlign, a framework that incorporates an explicit representation alignment mechanism within a diffusion Transformer architecture. During training, a reference alignment loss aligns features from the reference branch with the semantic space of a vision foundation model, pulling together features of the same subject while pushing apart those of different subjects. This enhances both identity consistency and semantic discriminability without introducing additional inference overhead, effectively balancing textual controllability and reference fidelity. Evaluated on the OpenS2V-Eval benchmark, RefAlign significantly outperforms existing methods, achieving state-of-the-art performance in both overall generation quality and identity consistency.

Technology Category

Application Category

📝 Abstract
Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.
Problem

Research questions and friction points this paper is trying to address.

reference-to-video generation
modality mismatch
copy-paste artifacts
multi-subject confusion
identity consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

representation alignment
reference-to-video generation
visual foundation model
diffusion Transformer
identity consistency
🔎 Similar Papers
No similar papers found.
Lei Wang
Lei Wang
Associate Professor at Lancaster University
Electromagnetics TheoryMicrowaveAntennasWireless Propagation
Y
YuXin Song
2 Baidu Inc.
G
Ge Wu
1PCA Lab, VCIP, College of Computer Science, Nankai University
Haocheng Feng
Haocheng Feng
Baidu
computer vision
Hang Zhou
Hang Zhou
Baidu Inc.
Computer VisionAudio ProcessingMultimodal Learning
J
Jingdong Wang
2 Baidu Inc.
Yaxing Wang
Yaxing Wang
Associate professor, Nankai University
Deep learningGANsImage-to-image translationTransfer learning
J
Jian Yang
1PCA Lab, VCIP, College of Computer Science, Nankai University; 3 PCA Lab, School of Intelligence Science and Technology, Nanjing University