🤖 AI Summary
This work addresses the degradation in human motion segmentation performance caused by violations of the Union of Subspaces (UoS) assumption in real-world video frames. To tackle this issue, the authors propose Temporal Structured Self-Expressive Clustering (TDSC), which alternately optimizes temporally consistent structured representations and self-expressive coefficients through a regularized self-expressive model. The method incorporates maximum coding rate regularization to prevent representation collapse and enforces temporal constraints to ensure consistent segment assignments across adjacent frames. Innovatively, TDSC integrates coding rate maximization with a temporal momentum averaging mechanism to stabilize affinity matrix evolution and employs a reparameterization strategy to enhance optimization efficiency. Extensive experiments on five benchmark datasets demonstrate that TDSC consistently outperforms existing methods when using HoG, CLIP, and DINOv2 features, confirming its effectiveness and robustness.
📝 Abstract
Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based on subspace clustering, which are grounded on the assumption that the distribution of high-dimensional temporal features well aligns with a Union-of-Subspaces (UoS). For videos in the real world, however, the raw frame-level features often violate the UoS assumption and yield unsatisfactory segmentation performance. To address this issue, we propose an efficient and effective approach for HMS, named Temporal Deep Self-expressive subspace Clustering (TDSC), which jointly learns temporally consistent structured representations and stabilized affinity for accurate and robust HMS. Specifically, in TDSC, we alternately learn structured representations of the input frame features and self-expressive coefficients via a properly regularized self-expressive model, in which a coding-rate maximization regularizer is incorporated to avoid representation collapse and conform the learned representations to span a desired UoS distribution, and meanwhile, temporal constraints are incorporated to promote temporally adjacent frames to be partitioned into the same groups. Moreover, we develop a temporal momentum averaging mechanism to stabilize affinity evolution and design a reparameterization strategy to enable efficient optimization. We conduct extensive experiments on five benchmark HMS datasets using both conventional (HoG) and up-to-date deep features (i.e., CLIP, DINOv2) to validate the effectiveness of our approach.