Joint Self-Supervised Video Alignment and Action Segmentation

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the joint modeling challenge of unsupervised video alignment and action segmentation. We propose the first unified self-supervised learning framework for both tasks. Methodologically, we formulate them jointly as a single optimal transport problem, introducing a structural-prior-augmented Gromov–Wasserstein (GW) metric to simultaneously optimize alignment and segmentation within a single model—without any manual annotations. Our core contributions are: (1) a multi-task unified objective function based on a structure-aware GW distance; and (2) a differentiable self-supervised optimization mechanism enabling end-to-end joint training. On multiple benchmarks, our method achieves state-of-the-art performance in video alignment and significantly outperforms existing unsupervised approaches in action segmentation mAP. Moreover, it reduces GPU memory consumption and storage overhead by approximately 50%.

Technology Category

Application Category

📝 Abstract
We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised video alignment using optimal transport
Joint video alignment and action segmentation unification
Efficient GPU training with few optimal transport iterations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fused Gromov-Wasserstein optimal transport formulation
Unified framework for joint alignment and segmentation
Single-model solution saving time and memory
🔎 Similar Papers
No similar papers found.
A
Ali Shah
Retrocausal, Inc.
A
Ali † Syed
Retrocausal, Inc.
M
Mubin Saeed
Retrocausal, Inc.
Andrey Konin
Andrey Konin
Chief Architect, Retrocausal, Inc.
Computer visionmachine learning
M
M. Zeeshan
Retrocausal, Inc.
Z
Zia Quoc-Huy
Retrocausal, Inc.
Quoc-Huy Tran
Quoc-Huy Tran
Retrocausal, Inc.
Video UnderstandingAction Recognition3D PerceptionAutonomous Driving