🤖 AI Summary
This work addresses visuomotor co-modeling from the first-person perspective, aiming to jointly predict head pose, gaze direction, and upper-body motion from egocentric videos and skeletal joint trajectories. We propose Visuomotor Co-Representation (VCR), the first diffusion-based framework for multimodal visuomotor temporal forecasting. VCR employs temporally aligned encoding and cross-modal attention to explicitly capture long-range inter-modal dependencies and enable generative prediction. Evaluated on the large-scale, real-world EgoExo4D dataset, our method significantly outperforms unimodal baselines as well as RNN- and Transformer-based approaches. It achieves state-of-the-art performance in both long-horizon prediction accuracy and cross-scenario generalization, demonstrating substantial improvements in modeling complex visuomotor dynamics under naturalistic egocentric settings.
📝 Abstract
Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a extit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.