🤖 AI Summary
In unsupervised action segmentation, frame- and segment-level representations suffer from the absence of segment-level supervision and weak learning feedback. To address this, we propose the Closed-Loop Optimal Transport (CLOT) framework. CLOT employs an encoder-decoder architecture to jointly learn frame/segment embeddings and pseudo-labels. It introduces a novel three-stage optimal transport (OT) scheme coupled with cross-level attention, enabling bidirectional closed-loop optimization between frame- and segment-level representations. Furthermore, a hierarchical OT formulation integrates segment-level self-supervision, enhancing temporal consistency and representation discriminability. Extensive experiments on four benchmark datasets demonstrate that CLOT significantly outperforms state-of-the-art methods—including ASOT—validating the effectiveness and generalizability of its iterative feature learning mechanism for unsupervised action segmentation.
📝 Abstract
Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions on the action ordering, and it is able to decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, which limits the effectiveness of the feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework that introduces a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.