FORESCENE: FOREcasting human activity via latent SCENE graphs diffusion

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing scene graph (SG) prediction methods assume fixed graph topologies, limiting their capacity for long-horizon modeling of highly dynamic human-object interactions—where objects frequently appear and disappear during daily activities. To address this, we propose the first latent diffusion model (LDM)-based framework for SG sequence prediction, jointly generating both nodes (objects) and edges (relationships) while enabling dynamic graph topology adaptation—i.e., structural insertion and deletion—thereby overcoming predefined topological constraints. Our method integrates a graph autoencoder (GAE) with an LDM to achieve prior-free, structure-adaptive, continuous interaction modeling. Evaluated on the Action Genome dataset, our approach significantly outperforms state-of-the-art methods and, for the first time under open-structure assumptions, achieves fine-grained human-object interaction prediction over 10-second horizons.

Technology Category

Application Category

📝 Abstract

Forecasting human-environment interactions in daily activities is challenging due to the high variability of human behavior. While predicting directly from videos is possible, it is limited by confounding factors like irrelevant objects or background noise that do not contribute to the interaction. A promising alternative is using Scene Graphs (SGs) to track only the relevant elements. However, current methods for forecasting future SGs face significant challenges and often rely on unrealistic assumptions, such as fixed objects over time, limiting their applicability to long-term activities where interacted objects may appear or disappear. In this paper, we introduce FORESCENE, a novel framework for Scene Graph Anticipation (SGA) that predicts both object and relationship evolution over time. FORESCENE encodes observed video segments into a latent representation using a tailored Graph Auto-Encoder and forecasts future SGs using a Latent Diffusion Model (LDM). Our approach enables continuous prediction of interaction dynamics without making assumptions on the graph's content or structure. We evaluate FORESCENE on the Action Genome dataset, where it outperforms existing SGA methods while solving a significantly more complex task.

Problem

Research questions and friction points this paper is trying to address.

Forecasting human-environment interactions in daily activities

Predicting object and relationship evolution over time

Overcoming limitations of fixed objects in scene graph forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Scene Graphs for tracking relevant elements

Employs Graph Auto-Encoder for latent representation

Forecasts with Latent Diffusion Model for dynamics

🔎 Similar Papers

No similar papers found.

Nuro

$193,930 and $291,150

Mountain View, California (HQ) / California - HQ, Nuro HQ - Mountain View, CA

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence