MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
For motion-intensive video understanding tasks—including atomic action recognition, atypical behavior detection in autism, and real-time MRI-based articulatory motion analysis—this paper introduces MOOSE, a framework that enables efficient and interpretable temporal modeling without end-to-end video model training. Methodologically, MOOSE leverages frozen pre-trained Vision Transformers (ViTs) and a RAFT optical flow encoder, augmented by a lightweight spatiotemporal feature alignment module. Its key contributions are: (1) the first optical-flow-driven temporal centering encoding paradigm; (2) an optical-flow-guided spatial attention mechanism that explicitly fuses flow cues with spatial embeddings; and (3) parameter-efficient reuse of off-the-shelf vision and flow encoders. MOOSE achieves state-of-the-art performance across diverse benchmarks—clinical behavioral analysis, MRI-based speech articulation recognition, and standard action recognition—while offering enhanced temporal modeling efficiency and interpretability.

Technology Category

Application Category

📝 Abstract
Many motion-centric video analysis tasks, such as atomic actions, detecting atypical motor behavior in individuals with autism, or analyzing articulatory motion in real-time MRI of human speech, require efficient and interpretable temporal modeling. Capturing temporal dynamics is a central challenge in video analysis, often requiring significant computational resources and fine-grained annotations that are not widely available. This paper presents MOOSE (Motion Flow Over Spatial Space), a novel temporally-centric video encoder explicitly integrating optical flow with spatial embeddings to model temporal information efficiently, inspired by human perception of motion. Unlike prior models, MOOSE takes advantage of rich, widely available pre-trained visual and optical flow encoders instead of training video models from scratch. This significantly reduces computational complexity while enhancing temporal interpretability. Our primary contributions includes (1) proposing a computationally efficient temporally-centric architecture for video understanding (2) demonstrating enhanced interpretability in modeling temporal dynamics; and (3) achieving state-of-the-art performance on diverse benchmarks, including clinical, medical, and standard action recognition datasets, confirming the broad applicability and effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Efficient temporal modeling for motion-centric video analysis
Reducing computational complexity in video temporal dynamics
Enhancing interpretability of motion in video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates optical flow with spatial embeddings
Uses pre-trained visual and flow encoders
Efficient temporal modeling for video understanding
🔎 Similar Papers
No similar papers found.