View-aware Cross-modal Distillation for Multi-view Action Recognition

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Addressing the practical challenges in multi-view action recognition—namely, partial sensor overlap, modality-limited inputs, and availability of only sequence-level annotations—this paper proposes a view-aware cross-modal knowledge distillation framework. Methodologically: (1) a view-aware consistency module aligns cross-view prediction distributions via human detection masks and confidence-weighted Jensen–Shannon divergence; (2) a cross-modal adapter models inter-modal dependencies using cross-attention mechanisms; (3) a sparse-label distillation strategy, grounded in predictive distribution consistency, enables robust learning under missing modalities. Evaluated on the real-world MultiSensor-Home dataset, our approach significantly outperforms existing distillation methods, achieving state-of-the-art performance across diverse backbone architectures and resource-constrained settings—surpassing even the teacher model in several configurations.

Technology Category

Application Category

📝 Abstract

The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-view action recognition with partially overlapping sensor coverage

Tackles limited input modalities and sparse sequence-level annotations

Solves view misalignment where actions appear differently across viewpoints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal knowledge distillation for limited modalities

View-aware consistency with human-detection masks

Cross-modal attention adapter for incomplete modality operation

🔎 Similar Papers

C3T: Cross-modal Transfer Through Time for Human Action Recognition