MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the problem of predicting semantically plausible and functionally consistent 3D scene motion from a single input image. To this end, we introduce MoMap—a pixel-aligned motion representation—and construct the first large-scale MoMap video dataset. We propose a novel two-stage 2D video synthesis paradigm: “MoMap generation followed by motion-guided warping and rendering.” Our method integrates diffusion-based motion generation, point-based rendering for occlusion-aware completion, and transfer of motion priors from pretrained image generation models. Extensive experiments demonstrate that our approach achieves high-fidelity 2D video synthesis with geometrically consistent 3D motion trajectories across diverse complex scenes. It significantly improves semantic plausibility and cross-scene generalization of predicted motion, outperforming prior single-image motion prediction methods. This work establishes a new foundation for embodied visual understanding and generation driven solely by a single image.

Technology Category

Application Category

📝 Abstract

This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.

Problem

Research questions and friction points this paper is trying to address.

Learning 3D motion priors from real-world videos

Predicting future 3D scene motion from single images

Generating semantically consistent motion via diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-aligned Motion Map representation for 3D scenes

Large-scale MoMap database from 50,000 real videos

Diffusion model trained on motion representations for synthesis

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion