CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unsupervised Temporal Action Localization (UTAL) faces two key challenges: (1) classification-pretrained visual features tend to overemphasize local discriminative regions, hindering global temporal structure learning; and (2) relying solely on visual modality limits precise action boundary modeling. To address these, this work introduces CLIP’s vision-language priors into UTAL for the first time, proposing a CLIP-guided cross-modal collaborative enhancement framework that jointly leverages visual, linguistic, and audio semantics—fully eliminating dependence on temporal annotations or class labels. Methodologically, we design a self-supervised cross-view contrastive learning paradigm, where audio-visual feature alignment and CLIP-driven multimodal co-optimization synergistically enhance contextual boundary discrimination. Extensive experiments demonstrate state-of-the-art performance on two standard benchmarks, validating the effectiveness and robustness of cross-modal priors for unsupervised action boundary modeling.

Technology Category

Application Category

📝 Abstract
Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised temporal action localization lacks labeled data
Existing methods focus too narrowly on discriminative regions
Visual-only approaches miss contextual audio boundary cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-assisted cross-view audiovisual enhancement
Visual language and classification collaborative enhancement
Self-supervised cross-view learning paradigm
🔎 Similar Papers
No similar papers found.
R
Rui Xia
Shenzhen International Graduate School, Tsinghua University, Beijing, China
Dan Jiang
Dan Jiang
Tsinghua University
CV、LLM
Q
Quan Zhang
Shenzhen International Graduate School, Tsinghua University, Beijing, China
K
Ke Zhang
Shenzhen International Graduate School, Tsinghua University, Beijing, China
C
Chun Yuan
Shenzhen International Graduate School, Tsinghua University, Beijing, China