RefAlign: Representation Alignment for Reference-to-Video Generation

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenges of copy-paste artifacts and multi-subject confusion in reference-based image-to-video generation, which arise from modality misalignment. To this end, the authors propose RefAlign, a framework that incorporates an explicit representation alignment mechanism within a diffusion Transformer architecture. During training, a reference alignment loss aligns features from the reference branch with the semantic space of a vision foundation model, pulling together features of the same subject while pushing apart those of different subjects. This enhances both identity consistency and semantic discriminability without introducing additional inference overhead, effectively balancing textual controllability and reference fidelity. Evaluated on the OpenS2V-Eval benchmark, RefAlign significantly outperforms existing methods, achieving state-of-the-art performance in both overall generation quality and identity consistency.

Technology Category

Application Category

📝 Abstract

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

Problem

Research questions and friction points this paper is trying to address.

reference-to-video generation

modality mismatch

copy-paste artifacts

multi-subject confusion

identity consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

representation alignment

reference-to-video generation

visual foundation model