š¤ AI Summary
This work addresses the challenge of flexible multimodal, multiview human activity recognition under arbitrary view availability, varying numbers of views, and heterogeneous modality combinations. To this end, the authors propose RALIS, a model that leverages multiview contrastive learning and a Mixture-of-Experts (MoE) architecture to support arbitrary view configurations during both training and inference. RALIS employs a modified centroid contrastive loss for reconstruction-free self-supervised representation learning and cross-view alignment, complemented by a view-weighting mechanism and a dedicated load-balancing strategy that reduces the computational complexity of multiview fusion from O(V²) to O(V). Evaluated on four datasets encompassing inertial and pose modalities with 3ā9 views, RALIS demonstrates superior performance and strong generalization capabilities.
š Abstract
Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose RALIS, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from $O(V^2)$ to $O(V)$, where $V$ is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. RALIS is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.