SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Current text-driven motion generation methods struggle to simultaneously capture motion semantics and model scene interactions, primarily due to the absence of large-scale, jointly annotated datasets featuring both rich text-motion alignments and precise scene geometry annotations. To address this, we propose a two-stage scene-aware adaptation framework: (1) a motion interpolation proxy task bridges disjoint text-motion and scene-motion data; (2) in the latent space, we introduce a learnable keyframe modulation layer and a cross-attention-based scene-conditioning layer to enable adaptive semantic-geometric fusion. Our method is the first to support multi-source information co-modeling without requiring joint annotations. It significantly improves scene consistency of generated motions and uncovers the intrinsic mechanisms underlying scene-aware injection.

Technology Category

Application Category

📝 Abstract

Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.

Problem

Research questions and friction points this paper is trying to address.

Injecting scene awareness into text-conditioned human motion generation models

Bridging disjoint scene-motion and text-motion datasets through adaptation stages

Overcoming limitations of isolated motion semantics or scene-awareness approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages disjoint datasets through two adaptation stages

Modulates motion latents with keyframing layers for inbetweening

Injects scene geometry via cross-attention conditioning layer

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos