🤖 AI Summary
Stylized Human-Scene Interaction (HSI) simulation suffers from limited motion diversity, weak stylistic expressiveness, and physically implausible behaviors. To address these challenges, we propose SIMS, a hierarchical framework: its upper layer employs Retrieval-Augmented Generation (RAG) coupled with large language models to generate long-horizon, multi-style semantic scripts; its lower layer introduces a multimodal physical controller integrating textual embeddings, geometry-aware constraints, and task objectives, synergizing diffusion-based motion synthesis with reinforcement learning. Key contributions include: (1) the first script-driven architecture unifying high-level intent planning and low-level physics control; (2) the first diverse, stylized HSI motion dataset and a RAG-generated planning dataset; and (3) an end-to-end mapping from text-based style prompts to physically plausible motion. Experiments demonstrate that SIMS significantly outperforms prior methods in cross-scene generalization, motion naturalness, stylistic richness, and adherence to physical constraints.
📝 Abstract
Simulating stylized human-scene interactions (HSI) in physical environments is a challenging yet fascinating task. Prior works emphasize long-term execution but fall short in achieving both diverse style and physical plausibility. To tackle this challenge, we introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy, enabling more expressive and diverse human-scene interactions. Specifically, we employ Large Language Models with Retrieval-Augmented Generation (RAG) to generate coherent and diverse long-form scripts, providing a rich foundation for motion planning. A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues, simultaneously perceiving environmental geometries and accomplishing task goals. By integrating the retrieval-augmented script generation with the multi-condition controller, our approach provides a unified solution for generating stylized HSI motions. We further introduce a comprehensive planning dataset produced by RAG and a stylized motion dataset featuring diverse locomotions and interactions. Extensive experiments demonstrate SIMS's effectiveness in executing various tasks and generalizing across different scenarios, significantly outperforming previous methods.