ðĪ AI Summary
This work addresses the challenge of open-world multimodal first-person activity recognition, where existing methods struggle to detect novel activities and suffer from catastrophic forgetting due to overreliance on the RGB modality. To overcome these limitations, the authors propose the MAND framework, which introduces Modality-aware Adaptive Scoring (MoAS) during inference to enhance novel activity detection by fusing energy scores across modalities. During training, MAND employs a Modality-level Representation Stability Training strategy (MoRST), combining auxiliary heads with modality-level logit distillation to preserve stable representations for each modality. Experiments on public benchmarks demonstrate that MAND achieves up to a 10% improvement in AUC for novel activity detection and a 2.8% gain in accuracy on known classes, significantly outperforming current continual learning approaches.
ð Abstract
Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10\% and known-class classification accuracy by up to 2.8\% over state-of-the-art baselines.