🤖 AI Summary
This work investigates the cross-context transferability of action recognition models to high-level motion concepts (e.g., “hitting”) in novel scenarios. We propose a Motion Transferability Assessment framework and introduce three benchmark datasets—Syn-TA (synthetic), Kinetics400-TA, and Something-Something-v2-TA (real-world)—to systematically evaluate 13 state-of-the-art models. Leveraging multimodal inputs and controlled-variable analysis, we identify critical bottlenecks in fine-grained action discrimination and temporal reasoning. We further demonstrate that disentangling coarse- and fine-grained motion representations significantly improves generalization. Experiments reveal: (1) severe performance degradation on in-distribution variants (e.g., unseen “hitting-a-person” instances); and (2) while large models excel at spatially dominant tasks, their over-reliance on object and background cues hinders intrinsic motion generalization. This work establishes a new benchmark for robust action understanding and provides an interpretable diagnostic pathway for motion-centric transferability.
📝 Abstract
Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.