UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak scene awareness and challenges in modeling continuous motion for language-driven human-object interaction (HOI) generation in complex indoor environments, this paper introduces the first unified multimodal motion-language model. Methodologically, we propose a hybrid motion representation combining continuous 6-DoF global poses with discrete local action tokens, and integrate a lookup-free quantized variational autoencoder (LFQ-VAE) to mitigate information loss from action tokenization. Furthermore, we construct the enhanced Lingo-HumanML3D dataset, enabling joint alignment of natural language, 3D scene geometry, motion sequences, and spatial waypoints. Experiments demonstrate state-of-the-art (SOTA) performance on the OMOMO benchmark for text-to-HOI generation, and achieve competitive SOTA on HumanML3D for text-to-motion generation—significantly improving motion plausibility and scene consistency.

Technology Category

Application Category

📝 Abstract
Human motion synthesis in complex scenes presents a fundamental challenge, extending beyond conventional Text-to-Motion tasks by requiring the integration of diverse modalities such as static environments, movable objects, natural language prompts, and spatial waypoints. Existing language-conditioned motion models often struggle with scene-aware motion generation due to limitations in motion tokenization, which leads to information loss and fails to capture the continuous, context-dependent nature of 3D human movement. To address these issues, we propose UniHM, a unified motion language model that leverages diffusion-based generation for synthesizing scene-aware human motion. UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes. Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE (LFQ-VAE) that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of the Lingo dataset augmented with HumanML3D annotations, providing stronger supervision for scene-specific motion learning. Experimental results demonstrate that UniHM achieves comparative performance on the OMOMO benchmark for text-to-HOI synthesis and yields competitive results on HumanML3D for general text-conditioned motion generation.
Problem

Research questions and friction points this paper is trying to address.

Generating human motion in complex indoor scenes with objects
Overcoming limitations in scene-aware motion tokenization and context
Unifying Text-to-Motion and Human-Object Interaction synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based generation for scene-aware motion synthesis
Mixed-motion representation combining 6DoF and local tokens
Look-Up-Free Quantization VAE for improved motion accuracy
🔎 Similar Papers
No similar papers found.
Z
Zichen Geng
Zeeshan Hayder
Zeeshan Hayder
Australian National University, Data61/CSIRO
Computer VisionMachine LearningAI
W
Wei Liu
A
Ajmal Mian
Senior Member, IEEE