SemanticMoments: Training-Free Motion Similarity via Third Moment Features

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing video representation methods often rely heavily on static appearance and scene context, struggling to effectively capture motion information at the semantic level. This work proposes a training-free motion representation approach that leverages features from pretrained semantic models and computes their higher-order temporal moments—particularly the third-order moment—to characterize the semantic structure of motion. By introducing higher-order temporal statistics into the semantic feature space, the method achieves a disentanglement of motion from appearance without requiring any additional training. Evaluated on the newly introduced SimMotion benchmark, the proposed approach significantly outperforms existing methods based on RGB frames, optical flow, and text supervision, demonstrating its effectiveness and superiority in motion understanding tasks.

Technology Category

Application Category

📝 Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

Problem

Research questions and friction points this paper is trying to address.

semantic motion

video retrieval

motion dynamics

appearance bias

motion representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SemanticMoments

motion similarity

higher-order moments