Seeing without Pixels: Perception from Camera Trajectories

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility of video content understanding using only camera motion trajectories—bypassing pixel-level processing. To this end, we propose CamFormer, a model that encodes sequences of camera poses into a joint embedding space aligned with natural language via contrastive learning, enabling cross-modal semantic alignment. Our key contribution is the first systematic validation that camera trajectories intrinsically encode rich semantic information: *how* a camera moves reliably reflects *what* action is occurring or *what* scene is being observed—establishing trajectory as a lightweight, robust, and general-purpose modality for video understanding. CamFormer is agnostic to pose estimation methods and achieves state-of-the-art performance on downstream tasks including cross-modal retrieval, action classification, and temporal reasoning. It demonstrates strong generalization across domains and robustness to modality variations, underscoring the viability of trajectory-centric video analysis.

Technology Category

Application Category

📝 Abstract
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words,"how you move"can indeed reveal"what you are doing"(egocentric) or"observing"(exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
Problem

Research questions and friction points this paper is trying to address.

Investigates whether video content can be perceived solely from camera motion trajectories
Proposes a method to align camera pose data with natural language descriptions
Demonstrates camera trajectory as a robust modality for video understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning aligns camera trajectories with language
CamFormer encoder projects pose trajectories into embeddings
Robust across diverse camera pose estimation methods
🔎 Similar Papers
No similar papers found.