SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation

📅 2024-11-29
📈 Citations: 1
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
Stylized Human-Scene Interaction (HSI) simulation suffers from limited motion diversity, weak stylistic expressiveness, and physically implausible behaviors. To address these challenges, we propose SIMS, a hierarchical framework: its upper layer employs Retrieval-Augmented Generation (RAG) coupled with large language models to generate long-horizon, multi-style semantic scripts; its lower layer introduces a multimodal physical controller integrating textual embeddings, geometry-aware constraints, and task objectives, synergizing diffusion-based motion synthesis with reinforcement learning. Key contributions include: (1) the first script-driven architecture unifying high-level intent planning and low-level physics control; (2) the first diverse, stylized HSI motion dataset and a RAG-generated planning dataset; and (3) an end-to-end mapping from text-based style prompts to physically plausible motion. Experiments demonstrate that SIMS significantly outperforms prior methods in cross-scene generalization, motion naturalness, stylistic richness, and adherence to physical constraints.

Technology Category

Application Category

📝 Abstract
Simulating stylized human-scene interactions (HSI) in physical environments is a challenging yet fascinating task. Prior works emphasize long-term execution but fall short in achieving both diverse style and physical plausibility. To tackle this challenge, we introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy, enabling more expressive and diverse human-scene interactions. Specifically, we employ Large Language Models with Retrieval-Augmented Generation (RAG) to generate coherent and diverse long-form scripts, providing a rich foundation for motion planning. A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues, simultaneously perceiving environmental geometries and accomplishing task goals. By integrating the retrieval-augmented script generation with the multi-condition controller, our approach provides a unified solution for generating stylized HSI motions. We further introduce a comprehensive planning dataset produced by RAG and a stylized motion dataset featuring diverse locomotions and interactions. Extensive experiments demonstrate SIMS's effectiveness in executing various tasks and generalizing across different scenarios, significantly outperforming previous methods.
Problem

Research questions and friction points this paper is trying to address.

Simulating diverse and physically plausible human-scene interactions.
Bridging high-level script intent with low-level control policy.
Generating stylized motions using retrieval-augmented script generation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework integrating script and control
Retrieval-Augmented Generation for diverse script creation
Multi-condition physics-based control for stylistic motion
🔎 Similar Papers
No similar papers found.