🤖 AI Summary
This work addresses the limited robustness of frame-level embeddings and difficulties in temporal alignment in video analysis. We propose Local-Alignment Contrastive (LAC), a self-supervised learning framework that, for the first time, integrates differentiable Smith–Waterman affine local alignment into contrastive learning to enable soft sequence alignment and dynamic modeling of temporal intervals. Coupled with a Transformer encoder, LAC establishes an end-to-end trainable paradigm for frame-level representation learning. Its core innovations are: (1) a differentiable local alignment loss that explicitly models nonlinear temporal deformations; and (2) an interval-aware penalty mechanism that enhances temporal structural awareness. On downstream tasks such as action recognition, LAC significantly outperforms state-of-the-art methods, substantially improving both the discriminability of frame-level embeddings and their capacity for temporal modeling.
📝 Abstract
Robust frame-wise embeddings are essential to perform video analysis and understanding tasks. We present a self-supervised method for representation learning based on aligning temporal video sequences. Our framework uses a transformer-based encoder to extract frame-level features and leverages them to find the optimal alignment path between video sequences. We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies with a contrastive loss to enhance discriminative learning. Prior works on video alignment have focused on using global temporal ordering across sequence pairs, whereas our loss encourages identifying the best-scoring subsequence alignment. LAC uses the differentiable Smith-Waterman (SW) affine method, which features a flexible parameterization learned through the training phase, enabling the model to adjust the temporal gap penalty length dynamically. Evaluations show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.