🤖 AI Summary
This study investigates the feasibility of video content understanding using only camera motion trajectories—bypassing pixel-level processing. To this end, we propose CamFormer, a model that encodes sequences of camera poses into a joint embedding space aligned with natural language via contrastive learning, enabling cross-modal semantic alignment. Our key contribution is the first systematic validation that camera trajectories intrinsically encode rich semantic information: *how* a camera moves reliably reflects *what* action is occurring or *what* scene is being observed—establishing trajectory as a lightweight, robust, and general-purpose modality for video understanding. CamFormer is agnostic to pose estimation methods and achieves state-of-the-art performance on downstream tasks including cross-modal retrieval, action classification, and temporal reasoning. It demonstrates strong generalization across domains and robustness to modality variations, underscoring the viability of trajectory-centric video analysis.
📝 Abstract
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words,"how you move"can indeed reveal"what you are doing"(egocentric) or"observing"(exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.