Deep kernel video approximation for unsupervised action segmentation

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the task of unsupervised action segmentation from a single video by proposing an efficient method that avoids large-scale data storage. It models the distribution of video frames in a deep kernel space and leverages the Neural Tangent Kernel (NTK) to enhance representational capacity while circumventing trivial solutions in joint optimization. By employing Maximum Mean Discrepancy (MMD) to measure a geometry-preserving distance between the true and approximated distributions, the approach achieves precise segmentation. Compared to optimal transport–based strategies, the method is more amenable to optimization and computationally faster. It attains state-of-the-art performance among single-video unsupervised methods across six standard benchmarks and significantly outperforms existing clustering approaches in F1 score when the number of segments is unknown.

Technology Category

Application Category

📝 Abstract

This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

Problem

Research questions and friction points this paper is trying to address.

unsupervised action segmentation

per-video segmentation

video approximation

action recognition

temporal segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

deep kernel learning

maximum mean discrepancy

neural tangent kernels