Action Dubber: Timing Audible Actions via Inflectional Flow

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the novel task of *audible action temporal localization*, which aims to localize the precise spatiotemporal coordinates of audible actions—such as collisions—triggered solely by motion discontinuities, using video input only. To address this, we propose *inflection flow*, a novel optical-flow-based representation capturing abrupt changes in motion acceleration via second-order derivatives. We further design TA²Net, the first end-to-end architecture that jointly models temporal and spatial localization by coupling self-supervised contrastive learning with spatial attention—without requiring audio supervision. We establish Audible623, the first dedicated benchmark for this task, and reformulate evaluation protocols on Kinetics and UCF101. Experiments demonstrate significant improvements over baselines in both temporal localization accuracy and sound-source spatial localization. Moreover, our method exhibits strong generalization to cross-task scenarios, including repetitive action counting.

Technology Category

Application Category

📝 Abstract
We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https://github.com/WenlongWan/Audible623.
Problem

Research questions and friction points this paper is trying to address.

Identify spatio-temporal coordinates of audible movements
Estimate inflectional flow for collision timing without audio
Improve temporal localization and sound source identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates inflectional flow via second motion derivative
Integrates self-supervised spatial localization strategy
Uses contrastive learning with spatial analysis
🔎 Similar Papers
No similar papers found.
W
Wenlong Wan
School of Computing and Information Systems, Singapore Management University
W
Weiying Zheng
School of Computing and Data Science, University of Hong Kong
Tianyi Xiang
Tianyi Xiang
PhD, City University of Hong Kong
Computer VisionMachine Learning
G
Guiqing Li
School of Computer Science and Engineering, South China University of Technology
Shengfeng He
Shengfeng He
Singapore Management University
Visual ComputingGenerative ModelsComputer VisionComputational PhotographyComputer Graphics