Punching Bag vs. Punching Person: Motion Transferability in Videos

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work investigates the cross-context transferability of action recognition models to high-level motion concepts (e.g., “hitting”) in novel scenarios. We propose a Motion Transferability Assessment framework and introduce three benchmark datasets—Syn-TA (synthetic), Kinetics400-TA, and Something-Something-v2-TA (real-world)—to systematically evaluate 13 state-of-the-art models. Leveraging multimodal inputs and controlled-variable analysis, we identify critical bottlenecks in fine-grained action discrimination and temporal reasoning. We further demonstrate that disentangling coarse- and fine-grained motion representations significantly improves generalization. Experiments reveal: (1) severe performance degradation on in-distribution variants (e.g., unseen “hitting-a-person” instances); and (2) while large models excel at spatially dominant tasks, their over-reliance on object and background cues hinders intrinsic motion generalization. This work establishes a new benchmark for robust action understanding and provides an interpretable diagnostic pathway for motion-centric transferability.

Technology Category

Application Category

📝 Abstract

Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.

Problem

Research questions and friction points this paper is trying to address.

Assessing motion transferability across diverse video contexts

Evaluating model performance on high-level unseen action variations

Exploring coarse and fine motion disentanglement for better recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion transferability framework with three datasets

Evaluated 13 state-of-the-art action recognition models

Disentangling coarse and fine motions for improvement

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion