🤖 AI Summary
Existing methods predict human-object interactions solely in 2D video frames, failing to ensure physical plausibility and unable to jointly model the three core aspects: *with whom*, *where*, and *how* interaction occurs. This work introduces the first video-driven 4D future interaction prediction framework, simultaneously forecasting the interacting object, its 3D spatial location, and the actor’s physically grounded pose (e.g., bending, pulling). Methodologically, we propose an environment-action cross-modal fusion mechanism that unifies spatial localization and action generation; integrate a multi-view aligned video encoder, a 3D scene geometry-aware module, an autoregressive interaction trajectory decoder, and a motion-prior-guided pose generation network. Evaluated on the Ego-Exo4D dataset, our approach achieves substantial gains—improving interaction object localization accuracy and pose physical plausibility, with overall performance surpassing state-of-the-art methods by over 30%.
📝 Abstract
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'where' and 'how'. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FIction that fuses the past video observation of the person's actions and their environment to predict both the 'where' and 'how' of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.