SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Text-driven 3D human motion editing suffers from misalignment between semantic editing instructions and generated motions. To address this, we propose a multi-task diffusion-Transformer hybrid framework. Our key contributions are: (1) introducing motion similarity prediction as an auxiliary task to jointly optimize editing accuracy and semantic consistency; (2) adopting a decoupled architecture that separately models motion editing and similarity discrimination, thereby enhancing interpretability and controllability; and (3) integrating motion embedding contrastive learning to improve cross-modal alignment between text and motion. Evaluated on the MotionFix benchmark, our method achieves state-of-the-art performance, yielding significant improvements in semantic alignment (+12.7%) and motion fidelity (+9.4%). This work establishes a new paradigm for controllable, semantics-aware 3D motion generation.

Technology Category

Application Category

📝 Abstract

Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.

Problem

Research questions and friction points this paper is trying to address.

Text-based 3D human motion editing alignment challenge

Misalignment between motion semantics and language instructions

Need for semantically meaningful motion representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task training with motion similarity prediction

Diffusion-Transformer-based architecture for separate tasks

State-of-the-art performance in alignment and fidelity

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion