🤖 AI Summary
This work addresses the critical problem of automatically identifying the first-person camera wearer in third-person videos, enabling multi-view collaborative understanding. We propose a sequential matching framework that jointly models temporal motion dynamics and performs cross-view person re-identification: optical flow is leveraged to extract discriminative temporal motion representations, while multimodal feature alignment bridges the domain gap between subjective and objective video streams for robust cross-view identity association. To support systematic evaluation, we introduce TF2025—the first large-scale, synchronized multi-view dataset—comprising aligned first- and third-person video sequences. On this benchmark, our method achieves 86.7% top-1 identification accuracy, outperforming frame-level baselines by 14.2%. This work establishes a scalable technical pathway and foundational dataset for real-time subject-object identity binding in applications such as immersive learning and collaborative robotics.
📝 Abstract
The increasing popularity of egocentric cameras has generated growing interest in studying multi-camera interactions in shared environments. Although large-scale datasets such as Ego4D and Ego-Exo4D have propelled egocentric vision research, interactions between multiple camera wearers remain underexplored-a key gap for applications like immersive learning and collaborative robotics. To bridge this, we present TF2025, an expanded dataset with synchronized first- and third-person views. In addition, we introduce a sequence-based method to identify first-person wearers in third-person footage, combining motion cues and person re-identification.