VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing video generation models, relying solely on pixel-level reconstruction objectives, often compromise motion coherence and physical plausibility. Method: We propose a joint appearance-motion latent representation framework that unifies both aspects within a shared latent space. Our approach employs dual-objective training—combining pixel reconstruction with explicit motion prediction—and introduces an auto-feedback Inner-Guidance inference mechanism that intrinsically incorporates motion priors into the generative process, without requiring auxiliary data or architectural modifications. Contribution/Results: The method is a plug-and-play, lightweight module compatible with mainstream video diffusion and autoregressive architectures. Experiments demonstrate state-of-the-art performance across multiple motion coherence metrics, with significant improvements in physical plausibility and visual realism, while preserving high-fidelity appearance reconstruction.

Technology Category

Application Category

📝 Abstract

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

Problem

Research questions and friction points this paper is trying to address.

Improves motion coherence in video generation

Integrates appearance and motion representations

Enhances visual quality without data modification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint appearance-motion representation

Inner-Guidance for coherent motion

Applicable to any video model

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion