Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D human motion synthesis methods primarily focus on geometric constraints while lacking deep semantic understanding of surrounding scenes. To address this, we propose a semantics-aware motion synthesis framework. Our method introduces (1) a unified Scene Semantic Occupancy (SSO) representation that jointly encodes CLIP-derived semantic features and spatial occupancy via shared linear dimensionality reduction; and (2) a bidirectional tri-plane decomposition architecture with frame-level scene queries, enabling instruction-driven, fine-grained motion generation conditioned on scene semantics. Extensive experiments on cluttered real-world datasets—including ShapeNet, PROX, and Replica—demonstrate significant improvements over state-of-the-art approaches in semantic fidelity, computational efficiency, and cross-scene generalization. Ablation studies further validate the effectiveness of both SSO and the query mechanism in capturing scene-aware motion priors.

Technology Category

Application Category

📝 Abstract
Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing human motion in 3D scenes with semantic understanding
Creating unified scene representation using semantic occupancy features
Generating motion controlled by scene context and movement instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Scene Semantic Occupancy for scene representation
Bi-directional tri-plane decomposition for compact SSO
Frame-wise scene query with CLIP encoding for motion control
🔎 Similar Papers
No similar papers found.
Jingyu Gong
Jingyu Gong
Shanghai Jiao Tong University
3D Computer Vision
K
Kunkun Tong
School of Computer Science and Technology, East China Normal University, Shanghai, China
Zhuoran Chen
Zhuoran Chen
New York University Shanghai
RoboticComputer Vision
C
Chuanhan Yuan
College of Computer Science of Chongqing University, Chongqing, China
M
Mingang Chen
Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai, China
Zhizhong Zhang
Zhizhong Zhang
Associate Researcher, East China Normal University
Computer Vision
T
Tan Xin
School of Computer Science and Technology, East China Normal University, Shanghai, China
X
Xie Yuan
School of Computer Science and Technology, East China Normal University, Shanghai, China