TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods rely on single-task-specific controllers, limiting their ability to model complex human-scene interactions (HSIs) requiring coupled multi-skill coordination—e.g., “holding an object while sitting down.” Method: We propose a unified Manifold Transformer framework that introduces, for the first time, a synergistic mechanism between shared ontology-aware tokens and maskable task-specific tokens to enable cross-skill knowledge transfer. We further design a learnable task tokenizer and an end-to-end multi-task reinforcement learning paradigm supporting variable-length inputs and arbitrary task compositions. Contribution/Results: Our approach overcomes the limitations of conventional single-controller architectures, achieving significant improvements in physical plausibility, interaction diversity, scene generalization, and compositional capability for complex HSI tasks. It additionally enables geometric editing of interaction targets and concurrent execution of multiple skills within a single unified policy.

Technology Category

Application Category

📝 Abstract
Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/
Problem

Research questions and friction points this paper is trying to address.

Unifying diverse Human-Scene Interactions (HSI) for animation and AI
Overcoming limitations of separate controllers for multi-skill HSI tasks
Enabling flexible adaptation to new scenarios with variable inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified transformer-based policy for multi-skill HSI
Task tokenization with masking mechanism
Variable length inputs for flexible adaptation
🔎 Similar Papers
No similar papers found.
L
Liang Pan
Shanghai AI Laboratory
Zeshi Yang
Zeshi Yang
Independent Researcher
character animationdeep reinforcement learningBayesian optimizationmotion planning
Z
Zhiyang Dou
The University of Hong Kong
W
Wenjia Wang
The University of Hong Kong
Buzhen Huang
Buzhen Huang
Southeast University
Computer VisionComputer Graphics
B
Bo Dai
The University of Hong Kong, Feeling AI
Taku Komura
Taku Komura
The University of Hong Kong
Character AnimationComputer GraphicsRobotics
J
Jingbo Wang
Shanghai AI Laboratory