Object-Aware 4D Human Motion Generation

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models frequently produce human motions exhibiting geometric distortions, semantic misalignment, and physically implausible behaviors—primarily due to the absence of 3D spatial and physical priors. To address this, we propose Motion Score Distilled Interaction (MSDI), the first framework integrating large language models’ (LLMs) spatial-semantic reasoning capabilities with motion diffusion score distillation (MSDS), enabling zero-shot, out-of-distribution object-aware 4D human motion generation without model fine-tuning. MSDI synergistically combines 3D Gaussian splatting representations, pre-trained motion diffusion models, and LLM-driven spatial prompting to explicitly model complex human-object interactions while preserving physical consistency. Extensive experiments demonstrate that MSDI significantly outperforms state-of-the-art methods in motion naturalness, physical plausibility, and cross-scenario generalization.

Technology Category

Application Category

📝 Abstract
Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.
Problem

Research questions and friction points this paper is trying to address.

Generating physically plausible 4D human motions
Addressing unrealistic deformations and semantic violations
Ensuring object-aware interactions without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-aware 4D motion generation using 3D Gaussian representations
Motion Score Distilled Interaction with spatial-aware optimization
Zero-shot approach leveraging motion diffusion and LLM semantics
🔎 Similar Papers
2024-09-05European Conference on Computer VisionCitations: 4