SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work proposes SceMoS, a novel framework for text-driven 3D human motion generation that addresses the challenge of jointly achieving global path planning and local contact realism without relying on computationally expensive full 3D scene representations. SceMoS demonstrates, for the first time, that structured 2D scene representations—specifically bird’s-eye-view maps and local height maps—can effectively replace full 3D supervision. By leveraging DINOv2-encoded bird’s-eye-view maps for text-conditioned autoregressive global trajectory planning and a conditional VQ-VAE-based geometric grounding motion tokenizer that embeds physical constraints via 2D height maps, the method reduces scene encoder parameters by over 50% while preserving motion realism and contact accuracy. Evaluated on the TRUMANS benchmark, SceMoS achieves state-of-the-art performance in both motion fidelity and contact precision, striking an optimal balance between efficiency and realism.

Technology Category

Application Category

📝 Abstract

Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

Problem

Research questions and friction points this paper is trying to address.

3D human motion synthesis

scene-aware motion

physical feasibility

text-driven animation

human-scene interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

scene-aware motion synthesis

2D scene representation

geometry-grounded tokenization