Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-guided human-object interaction (HOI) generation methods typically encode the human body as a single token, failing to capture fine-grained joint-level interactions and resulting in geometric distortions; conversely, explicit per-joint modeling incurs prohibitive computational overhead. To address this, we propose Dual-branch HOI Mamba—a novel framework enabling efficient, explicit joint-level interaction modeling for the first time. Our approach features a dual-branch spatiotemporal Mamba architecture with conditional injection, a Dynamic Interaction Block that adaptively selects salient joint–object relationships, and a progressive masking strategy to enhance training stability. Extensive experiments demonstrate state-of-the-art performance across public benchmarks: our method achieves superior action fidelity and physical plausibility while accelerating inference by 20× (reducing latency to only 5% of prior work).

Technology Category

Application Category

📝 Abstract
We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5% of the inference time. Code is available href{https://github.com/Huanggh531/EJIM}{here}.
Problem

Research questions and friction points this paper is trying to address.

Achieves explicit joint-level modeling for realistic HOIs
Reduces computational overhead in joint-level interaction modeling
Integrates text semantics and object geometry into motions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch HOI Mamba for efficient modeling
Dynamic Interaction Block filters irrelevant joints
Progressive masking ensures accurate interaction modeling
🔎 Similar Papers
No similar papers found.
G
Guohong Huang
Sun Yat-sen University, China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
Ling-An Zeng
Ling-An Zeng
Sun Yat-sen University
Computer Vision
Z
Zexin Zheng
Sun Yat-sen University, China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
S
Shengbo Gu
Sun Yat-sen University, China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning