Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of weakly supervised temporal action localization—where only video-level annotations are available without frame-level supervision—this paper proposes a two-stage pseudo-label learning framework. First, a dual-stream network jointly models appearance and motion features at original and downsampled temporal scales to generate initial frame-level class activation sequences across multiple temporal resolutions. Second, temporal consistency constraints and an iterative optimization mechanism are introduced to enable cross-scale pseudo-label interaction and progressive refinement. The key contributions are: (1) a multi-resolution temporal consistency-driven module for initial pseudo-label generation; and (2) a progressive pseudo-label optimization framework. Extensive experiments demonstrate significant performance gains over state-of-the-art methods on THUMOS14 and ActivityNet-v1.3, validating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).
Problem

Research questions and friction points this paper is trying to address.

Improving weakly supervised temporal action localization accuracy
Generating high-quality frame-level pseudo labels
Exploiting multi-resolution temporal information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage approach for pseudo label refinement
Multi-resolution temporal information exploitation
Iterative label refinement with dual networks
🔎 Similar Papers
No similar papers found.
Rui Su
Rui Su
University of Sydney
Action DetectionVisual Grounding
D
Dong Xu
School of Electrical and Information Engineering, The University of Sydney, NSW, Australia
Luping Zhou
Luping Zhou
School of Electrical and Computer Engineering, University of Sydney
Medical ImagingComputer VisionMachine Learning
W
Wanli Ouyang
School of Electrical and Information Engineering, The University of Sydney, NSW, Australia