EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

πŸ“… 2026-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

211K/year
πŸ€– AI Summary
This work addresses the scarcity of large-scale, densely annotated first-person interaction videos in real-world settings, which is hindered by high acquisition costs, privacy concerns, and insufficient coverage. To overcome these limitations, the authors propose the first highly controllable simulator for synthesizing first-person videos, leveraging physics-driven simulation to accurately model camera viewpoints, human body poses, hand motions, object manipulations, and temporal dynamics. The framework supports dense multi-task annotations and generates a synthetic dataset that significantly outperforms strong baseline methods on multiple real-world benchmarks. Notably, models trained exclusively on this synthetic data achieve remarkable performance, demonstrating the approach’s effectiveness and novelty in fine-grained human-object interaction understanding and cross-domain transfer.
πŸ“ Abstract
Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.
Problem

Research questions and friction points this paper is trying to address.

egocentric video
synthetic data
human-object interaction
temporal annotation
interaction anticipation
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic egocentric video
controllable simulation
human-object interaction
temporal dynamics
interaction anticipation