🤖 AI Summary
Current text-driven motion generation methods struggle to simultaneously capture motion semantics and model scene interactions, primarily due to the absence of large-scale, jointly annotated datasets featuring both rich text-motion alignments and precise scene geometry annotations. To address this, we propose a two-stage scene-aware adaptation framework: (1) a motion interpolation proxy task bridges disjoint text-motion and scene-motion data; (2) in the latent space, we introduce a learnable keyframe modulation layer and a cross-attention-based scene-conditioning layer to enable adaptive semantic-geometric fusion. Our method is the first to support multi-source information co-modeling without requiring joint annotations. It significantly improves scene consistency of generated motions and uncovers the intrinsic mechanisms underlying scene-aware injection.
📝 Abstract
Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.