RepVideo: Rethinking Cross-Layer Representation for Video Generation

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In video generation, unstable intermediate-layer features cause temporal incoherence across frames and spatial distortion. This work identifies inter-layer attention map discrepancy as a key source of temporal inconsistency—a finding not previously reported. We propose Cross-Layer Feature Accumulation (CLFA), a parameter-free mechanism that aggregates semantic features across multiple layers to enhance consistency in attention inputs, thereby jointly improving spatial fidelity and temporal coherence. CLFA is integrated into text-to-video diffusion frameworks by redesigning the input representation for attention modules. Experiments demonstrate state-of-the-art performance: improvements of 12.3% in FVD and 9.7% in FID, alongside superior scores on TCL. The method significantly enhances modeling accuracy of multi-object spatial relationships and structural similarity between adjacent frames.

Technology Category

Application Category

📝 Abstract

Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.

Problem

Research questions and friction points this paper is trying to address.

Video Generation

Information Variation

Visual Coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

RepVideo

Multi-level Information Integration

Visual Coherence Enhancement

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence