VIDMP3: Video Editing by Representing Motion with Pose and Position Priors

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based video editing methods suffer from temporal inconsistency, subject identity drift, and heavy reliance on manual intervention—particularly in structure-varying tasks—hindering simultaneous preservation of motion fidelity and flexible semantic/structural editing. This paper proposes a motion-preserving video editing framework that explicitly models pose and spatial position priors to construct a general, disentangled motion representation, enabling the first diffusion-based, fully automatic structure-varying editing without user guidance. Our approach integrates pose estimation, optical flow-guided feature transfer, and position-aware attention to accurately extract and propagate spatiotemporal motion characteristics from the source video. Extensive evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art methods, both qualitatively (e.g., temporal coherence) and quantitatively (e.g., lower FVD and LPIPS scores). The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Motion-preserved video editing is crucial for creators, particularly in scenarios that demand flexibility in both the structure and semantics of swapped objects. Despite its potential, this area remains underexplored. Existing diffusion-based editing methods excel in structure-preserving tasks, using dense guidance signals to ensure content integrity. While some recent methods attempt to address structure-variable editing, they often suffer from issues such as temporal inconsistency, subject identity drift, and the need for human intervention. To address these challenges, we introduce VidMP3, a novel approach that leverages pose and position priors to learn a generalized motion representation from source videos. Our method enables the generation of new videos that maintain the original motion while allowing for structural and semantic flexibility. Both qualitative and quantitative evaluations demonstrate the superiority of our approach over existing methods. The code will be made publicly available at https://github.com/sandeep-sm/VidMP3.
Problem

Research questions and friction points this paper is trying to address.

Addresses motion inconsistency in structure-variable video editing
Solves subject identity drift during semantic object swapping
Eliminates human intervention needs in motion-preserved video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pose and position priors for motion representation
Learns generalized motion from source video data
Maintains original motion with structural flexibility
🔎 Similar Papers
No similar papers found.
S
Sandeep Mishra
University of Texas at Austin
Oindrila Saha
Oindrila Saha
UMass Amherst
Machine LearningDeep LearningComputer Vision
A
Alan C. Bovik
University of Texas at Austin