VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the problem of short-term human-object interaction prediction in egocentric videos, aiming to forecast future interacting objects’ bounding boxes, noun and verb categories, contact timestamps, and confidence scores. The authors propose a multi-task prediction framework that integrates spatiotemporal context by building upon the StillFast architecture. Leveraging a high-resolution final frame for object detection, the method innovatively injects frozen V-JEPA 2.1 temporal representations into the Faster R-CNN detection pipeline. This is achieved through feature modulation and region-of-interest (ROI)-level context fusion, enabling object-centric spatial awareness and temporal modeling. Combined with multi-head prediction and model ensembling, the approach secured first place in the EgoVis 2026 Ego4D Short-Term Interaction Prediction Challenge.

📝 Abstract

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

Problem

Research questions and friction points this paper is trying to address.

egocentric vision

object interaction anticipation

short-term prediction

human-object interaction

temporal anticipation

Innovation

Methods, ideas, or system contributions that make the work stand out.

V-JEPA

StillFast

egocentric video anticipation

temporal context fusion

object interaction prediction

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)