Video Action Differencing

๐Ÿ“… 2025-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper introduces the novel task of video action discrepancy identification, aiming to precisely localize and describe subtle differences when two individuals perform the same actionโ€”critical for sports instruction and skill assessment. To support this task, we construct VidDiffBench, the first fine-grained action discrepancy benchmark, comprising 549 video pairs with expert annotations. We propose a three-stage agent-based workflow that decouples discrepancy proposal, keyframe localization, and frame-level comparison. Our architecture integrates multimodal foundation models and employs a proxy-driven design: it jointly leverages human-annotated local sub-action localization and high-fidelity inter-frame discrepancy modeling. Extensive experiments on VidDiffBench demonstrate substantial improvements over state-of-the-art multimodal large language models, including GPT-4o and Qwen2-VL. Both code and dataset are publicly released to foster community advancement.

Technology Category

Application Category

๐Ÿ“ Abstract
How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.
Problem

Research questions and friction points this paper is trying to address.

Identify subtle differences in same-action videos.
Develop benchmark for video action differencing tasks.
Propose method to localize and compare video frames.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VidDiff for video action differencing.
Creates VidDiffBench with annotated video pairs.
Proposes agentic workflow with specialized foundation models.
๐Ÿ”Ž Similar Papers
No similar papers found.