SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-driven motion generation methods struggle to simultaneously capture motion semantics and model scene interactions, primarily due to the absence of large-scale, jointly annotated datasets featuring both rich text-motion alignments and precise scene geometry annotations. To address this, we propose a two-stage scene-aware adaptation framework: (1) a motion interpolation proxy task bridges disjoint text-motion and scene-motion data; (2) in the latent space, we introduce a learnable keyframe modulation layer and a cross-attention-based scene-conditioning layer to enable adaptive semantic-geometric fusion. Our method is the first to support multi-source information co-modeling without requiring joint annotations. It significantly improves scene consistency of generated motions and uncovers the intrinsic mechanisms underlying scene-aware injection.

Technology Category

Application Category

📝 Abstract
Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.
Problem

Research questions and friction points this paper is trying to address.

Injecting scene awareness into text-conditioned human motion generation models
Bridging disjoint scene-motion and text-motion datasets through adaptation stages
Overcoming limitations of isolated motion semantics or scene-awareness approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages disjoint datasets through two adaptation stages
Modulates motion latents with keyframing layers for inbetweening
Injects scene geometry via cross-attention conditioning layer
🔎 Similar Papers
No similar papers found.
J
Jungbin Cho
Yonsei University
M
Minsu Kim
Yonsei University
J
Jisoo Kim
Yonsei University
C
Ce Zheng
Carnegie Mellon University
Laszlo A. Jeni
Laszlo A. Jeni
Assistant Professor in the Robotics Institute, Carnegie Mellon University
Computational BehaviorComputer VisionDeep Learning
Ming-Hsuan Yang
Ming-Hsuan Yang
University of California at Merced; Google DeepMind
Computer VisionMachine LearningArtificial Intelligence
Y
Youngjae Yu
Seoul National University
S
Seonjoo Kim
Yonsei University