Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment

📅 2024-09-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the limited robustness of frame-level embeddings and difficulties in temporal alignment in video analysis. We propose Local-Alignment Contrastive (LAC), a self-supervised learning framework that, for the first time, integrates differentiable Smith–Waterman affine local alignment into contrastive learning to enable soft sequence alignment and dynamic modeling of temporal intervals. Coupled with a Transformer encoder, LAC establishes an end-to-end trainable paradigm for frame-level representation learning. Its core innovations are: (1) a differentiable local alignment loss that explicitly models nonlinear temporal deformations; and (2) an interval-aware penalty mechanism that enhances temporal structural awareness. On downstream tasks such as action recognition, LAC significantly outperforms state-of-the-art methods, substantially improving both the discriminability of frame-level embeddings and their capacity for temporal modeling.

Technology Category

Application Category

📝 Abstract

Robust frame-wise embeddings are essential to perform video analysis and understanding tasks. We present a self-supervised method for representation learning based on aligning temporal video sequences. Our framework uses a transformer-based encoder to extract frame-level features and leverages them to find the optimal alignment path between video sequences. We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies with a contrastive loss to enhance discriminative learning. Prior works on video alignment have focused on using global temporal ordering across sequence pairs, whereas our loss encourages identifying the best-scoring subsequence alignment. LAC uses the differentiable Smith-Waterman (SW) affine method, which features a flexible parameterization learned through the training phase, enabling the model to adjust the temporal gap penalty length dynamically. Evaluations show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.

Problem

Research questions and friction points this paper is trying to address.

Develops self-supervised video representation learning using temporal alignment.

Introduces Local-Alignment Contrastive loss for better temporal dependency capture.

Enhances action recognition with dynamic temporal gap penalty adjustments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised contrastive learning for video representation.

Transformer-based encoder for frame-level feature extraction.

Differentiable Smith-Waterman method for dynamic alignment.

🔎 Similar Papers

No similar papers found.