🤖 AI Summary
It remains unclear whether performance degradation in egocentric visual object tracking and segmentation stems primarily from inherent egocentric imaging characteristics (e.g., motion blur, field-of-view shift, self-occlusion) or from task-level challenges common to human-object interaction. Method: To disentangle these factors, the authors introduce Ego-vs-Exo—the first cross-perspective benchmark enabling controlled evaluation by decoupling viewpoint-specific effects from activity semantics—and propose a variable-controlled assessment framework that systematically compares egocentric and exocentric videos under identical task conditions. Contribution/Results: Experiments reveal that ~40% of the performance gap arises from domain-general difficulties shared across perspectives, while the remaining disparity is attributable solely to egocentric imaging properties. This work provides the first quantitative decomposition of viewpoint effects, establishing a new paradigm for algorithm design and benchmark development grounded in empirically validated insights.
📝 Abstract
Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-of-the-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements on this task.