Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address action representation degradation in wild videos caused by extreme viewpoint variations and severe visual occlusions, this paper proposes a geometry-aware fine-grained viewpoint-occlusion ranking and progressive knowledge distillation framework. We introduce the first viewpoint-occlusion quantification metric grounded in 3D geometric priors, integrated with action semantic consistency constraints to establish a curriculum-based distillation mechanism that enables adaptive learning from easy to hard viewpoint pairs. The method jointly incorporates geometric-driven ranking, curriculum learning, knowledge distillation, and temporal action modeling—overcoming the reliance of conventional multi-view approaches on low-occlusion scenarios. Our framework achieves state-of-the-art performance on both temporal key-step localization and fine-grained key-step recognition, with particularly significant improvements under highly occluded viewpoints.

Technology Category

Application Category

📝 Abstract

Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.

Problem

Research questions and friction points this paper is trying to address.

Learning view-invariant video representations under extreme viewpoint changes

Handling severe view-occlusions in uncontrolled real-world videos

Preserving action semantics across drastically different viewpoints via curriculum learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-based metric ranks views by occlusion

Curriculum knowledge distillation pairs challenging views

Preserves action-centric semantics under occlusion

🔎 Similar Papers

A Dataset and Framework for Learning State-invariant Object Representations