SemanticMoments: Training-Free Motion Similarity via Third Moment Features

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video representation methods often rely heavily on static appearance and scene context, struggling to effectively capture motion information at the semantic level. This work proposes a training-free motion representation approach that leverages features from pretrained semantic models and computes their higher-order temporal moments—particularly the third-order moment—to characterize the semantic structure of motion. By introducing higher-order temporal statistics into the semantic feature space, the method achieves a disentanglement of motion from appearance without requiring any additional training. Evaluated on the newly introduced SimMotion benchmark, the proposed approach significantly outperforms existing methods based on RGB frames, optical flow, and text supervision, demonstrating its effectiveness and superiority in motion understanding tasks.

Technology Category

Application Category

📝 Abstract
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
Problem

Research questions and friction points this paper is trying to address.

semantic motion
video retrieval
motion dynamics
appearance bias
motion representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SemanticMoments
motion similarity
higher-order moments
training-free
video representation
🔎 Similar Papers
No similar papers found.