BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation

📅 2025-05-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current zero-shot customized text-to-video (T2V) generation methods suffer from poor generalization and structural/texture distortions arising from naïve integration of text-to-image (T2I) models with motion modules. To address these limitations, we propose a test-time autoregressive structure-texture propagation and reward optimization framework. Our key contributions are: (1) the first test-time autoregressive Structure-Texture Propagation Module (STPM), which enforces identity and geometric consistency across frames via CLIP- and DINO-guided cross-frame feature propagation; and (2) Test-Time Reinforcement Learning Optimization (TTRO), which dynamically refines fine-grained texture generation through reward-based calibration. Crucially, our method preserves T2I priors while achieving high-fidelity, temporally coherent customized video synthesis. Experiments demonstrate significant improvements—+7.8 in CLIP-I similarity and +13.1 in DINO consistency over baselines—alongside superior qualitative and quantitative performance across standard benchmarks.

Technology Category

Application Category

📝 Abstract
Both zero-shot and tuning-based customized text-to-image (CT2I) generation have made significant progress for storytelling content creation. In contrast, research on customized text-to-video (CT2V) generation remains relatively limited. Existing zero-shot CT2V methods suffer from poor generalization, while another line of work directly combining tuning-based T2I models with temporal motion modules often leads to the loss of structural and texture information. To bridge this gap, we propose an autoregressive structure and texture propagation module (STPM), which extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency. Additionally, we introduce a test-time reward optimization (TTRO) method to further refine fine-grained details. Quantitative and qualitative experiments validate the effectiveness of STPM and TTRO, demonstrating improvements of 7.8 and 13.1 in CLIP-I and DINO consistency metrics over the baseline, respectively.
Problem

Research questions and friction points this paper is trying to address.

Bridging gap between customized image and video generation
Improving structural and texture consistency in video frames
Enhancing fine-grained details through test-time optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive structure and texture propagation module
Test-time reward optimization for refinement
Enhances consistency with key feature injection
🔎 Similar Papers
No similar papers found.
P
Panwen Hu
Mohamed bin Zayed University of Artificial Intelligence
Jiehui Huang
Jiehui Huang
Sun Yat-sen University
Machine LearningComputer VisionEmbodied AIMaterials Science
Q
Qiang Sun
Mohamed bin Zayed University of Artificial Intelligence, University of Toronto
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning