🤖 AI Summary
To address action representation degradation in wild videos caused by extreme viewpoint variations and severe visual occlusions, this paper proposes a geometry-aware fine-grained viewpoint-occlusion ranking and progressive knowledge distillation framework. We introduce the first viewpoint-occlusion quantification metric grounded in 3D geometric priors, integrated with action semantic consistency constraints to establish a curriculum-based distillation mechanism that enables adaptive learning from easy to hard viewpoint pairs. The method jointly incorporates geometric-driven ranking, curriculum learning, knowledge distillation, and temporal action modeling—overcoming the reliance of conventional multi-view approaches on low-occlusion scenarios. Our framework achieves state-of-the-art performance on both temporal key-step localization and fine-grained key-step recognition, with particularly significant improvements under highly occluded viewpoints.
📝 Abstract
Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.