🤖 AI Summary
Existing HOI perception and generation methods operate under an offline setting—assuming full-sequence access—which limits their applicability to real-world online scenarios where only current and historical observations are available. This work introduces, for the first time, two novel tasks: online HOI perception and online HOI generation, explicitly designed to model temporal dependencies and state evolution in streaming human-object interactions. To this end, we propose a memory-augmented framework built upon the Mamba architecture, leveraging its linear-complexity state-space modeling capability to efficiently encode historical interactions and update latent states in real time. Evaluated on Core4D and OAKINK2 (for online generation) and HOI4D (for online perception), our method significantly outperforms offline baselines and achieves state-of-the-art performance, demonstrating both the feasibility and effectiveness of online HOI modeling.
📝 Abstract
The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba's powerful modeling capabilities for streaming data and the Memory mechanism's efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.