How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the high uncertainty in first-person future action prediction, a challenge that limits conventional approaches relying on averaged multimodal forecasts. The authors propose leveraging future camera trajectories as an implicit signal of user intent to guide behavior prediction within an action-aligned embedding space, eliminating the need for language-based conditioning. They demonstrate for the first time that camera motion more accurately captures operator intent than linguistic descriptions, and that this advantage persists even at test time without access to ground-truth trajectories. By integrating trajectory forecasting, action-aligned embeddings, and monocular RGB-based pose estimation, the method significantly outperforms existing baselines across multiple benchmarks—including Ego-Exo4D, Ego4D, EPIC-Kitchens-100, and a basketball shooting task—with particularly notable gains in long-horizon prediction scenarios.

📝 Abstract

Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

Problem

Research questions and friction points this paper is trying to address.

egocentric prediction

future trajectory

action anticipation

intent modeling

first-person vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-conditioned prediction

egocentric vision

action anticipation