VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the problem of short-term human-object interaction prediction in egocentric videos, aiming to forecast future interacting objects’ bounding boxes, noun and verb categories, contact timestamps, and confidence scores. The authors propose a multi-task prediction framework that integrates spatiotemporal context by building upon the StillFast architecture. Leveraging a high-resolution final frame for object detection, the method innovatively injects frozen V-JEPA 2.1 temporal representations into the Faster R-CNN detection pipeline. This is achieved through feature modulation and region-of-interest (ROI)-level context fusion, enabling object-centric spatial awareness and temporal modeling. Combined with multi-head prediction and model ensembling, the approach secured first place in the EgoVis 2026 Ego4D Short-Term Interaction Prediction Challenge.
📝 Abstract
We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.
Problem

Research questions and friction points this paper is trying to address.

egocentric vision
object interaction anticipation
short-term prediction
human-object interaction
temporal anticipation
Innovation

Methods, ideas, or system contributions that make the work stand out.

V-JEPA
StillFast
egocentric video anticipation
temporal context fusion
object interaction prediction
🔎 Similar Papers
No similar papers found.