FIction: 4D Future Interaction Prediction from Video

📅 2024-12-01
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods predict human-object interactions solely in 2D video frames, failing to ensure physical plausibility and unable to jointly model the three core aspects: *with whom*, *where*, and *how* interaction occurs. This work introduces the first video-driven 4D future interaction prediction framework, simultaneously forecasting the interacting object, its 3D spatial location, and the actor’s physically grounded pose (e.g., bending, pulling). Methodologically, we propose an environment-action cross-modal fusion mechanism that unifies spatial localization and action generation; integrate a multi-view aligned video encoder, a 3D scene geometry-aware module, an autoregressive interaction trajectory decoder, and a motion-prior-guided pose generation network. Evaluated on the Ego-Exo4D dataset, our approach achieves substantial gains—improving interaction object localization accuracy and pose physical plausibility, with overall performance surpassing state-of-the-art methods by over 30%.

Technology Category

Application Category

📝 Abstract
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'where' and 'how'. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FIction that fuses the past video observation of the person's actions and their environment to predict both the 'where' and 'how' of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.
Problem

Research questions and friction points this paper is trying to address.

Predict 3D locations of future human-object interactions
Estimate execution poses for upcoming interactions
Overcome limitations of 2D-only interaction prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D future interaction prediction from videos
Fuses past video and environment observations
Predicts object locations and interaction poses
🔎 Similar Papers
No similar papers found.