🤖 AI Summary
In egocentric human activity recognition (HAR), inertial measurement unit (IMU) modalities suffer from scarce labeled data and poor generalization, while video modalities incur high power consumption and privacy risks. To address this trade-off, we propose an unsupervised cross-modal knowledge distillation framework. Our method transfers semantic knowledge from a frozen vision transformer (ViT) encoder to an IMU-based temporal encoder (TCN or Transformer) via a self-supervised video→IMU distillation scheme—requiring no labels. Crucially, we introduce a dynamically updated momentum instance queue to align feature distributions across modalities. Evaluated on multiple egocentric HAR benchmarks, our approach significantly improves classification accuracy, matching the performance of fully supervised fine-tuning baselines, while demonstrating superior cross-dataset generalization.
📝 Abstract
Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at https://github.com/Breezelled/COMODO .