TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing methods rely on single-task-specific controllers, limiting their ability to model complex human-scene interactions (HSIs) requiring coupled multi-skill coordination—e.g., “holding an object while sitting down.” Method: We propose a unified Manifold Transformer framework that introduces, for the first time, a synergistic mechanism between shared ontology-aware tokens and maskable task-specific tokens to enable cross-skill knowledge transfer. We further design a learnable task tokenizer and an end-to-end multi-task reinforcement learning paradigm supporting variable-length inputs and arbitrary task compositions. Contribution/Results: Our approach overcomes the limitations of conventional single-controller architectures, achieving significant improvements in physical plausibility, interaction diversity, scene generalization, and compositional capability for complex HSI tasks. It additionally enables geometric editing of interaction targets and concurrent execution of multiple skills within a single unified policy.

Technology Category

Application Category

📝 Abstract

Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/

Problem

Research questions and friction points this paper is trying to address.

Unifying diverse Human-Scene Interactions (HSI) for animation and AI

Overcoming limitations of separate controllers for multi-skill HSI tasks

Enabling flexible adaptation to new scenarios with variable inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified transformer-based policy for multi-skill HSI

Task tokenization with masking mechanism

Variable length inputs for flexible adaptation

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)