Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

πŸ“… 2026-04-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

196K/year
πŸ€– AI Summary
Future video prediction in complex dynamic scenes struggles to simultaneously preserve visual fidelity and semantic consistency. To address this challenge, this work proposes Re2Pix, a novel hierarchical prediction framework that pioneers a β€œsemantics-first” paradigm: it first autoregressively predicts future semantic representations within the frozen feature space of a vision foundation model and then leverages these predictions to guide a latent diffusion model in synthesizing high-fidelity video frames. To mitigate representation mismatch between training and inference, the framework incorporates nested dropout and a hybrid supervision strategy. Experiments demonstrate that Re2Pix significantly outperforms strong diffusion-based baselines on autonomous driving benchmarks, achieving consistent improvements in temporal semantic coherence, perceptual quality, and training efficiency.

Technology Category

Application Category

πŸ“ Abstract
Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix
Problem

Research questions and friction points this paper is trying to address.

video prediction
semantic consistency
visual fidelity
autonomous driving
scene dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-guided video prediction
hierarchical decomposition
latent diffusion model
representation forecasting
train-test mismatch mitigation
πŸ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30