🤖 AI Summary
This work addresses the challenges of scarce labeled data and missing sensor signals commonly encountered in real-world wearable applications, where existing self-supervised methods often generate unreliable, hallucinated details under modality dropout. To overcome this, the authors propose VCR, a self-supervised framework that employs an orthogonal tokenizer to strictly disentangle multimodal signals into shared semantics and modality-specific residuals. A missingness-aware mixture-of-experts backbone is designed to reconstruct only the inferable shared components, thereby avoiding erroneous reconstruction of missing modality details. This approach significantly enhances representation robustness. Experiments demonstrate that VCR consistently outperforms both strong supervised and self-supervised baselines across complete, single-modality, and multi-modality missing settings, achieving notable improvements in performance and robustness on diverse health monitoring tasks.
📝 Abstract
Wearable devices enable continuous health monitoring from multimodal signals, but real-world deployment is hindered by limited labeled data and pervasive sensor incompleteness. While large-scale self-supervised pretraining reduces label dependence, most existing methods assume full modality availability. Current approaches for handling modality missingness often reconstruct entire absent signals, which can encourage hallucinating modality-specific details that are not inferable from the observed sensor signals and degrade robustness. We propose VCR, a self-supervised framework that learns to extract valid representations robust to modality missingness. VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details. Across multiple health monitoring tasks, VCR consistently improves performance and robustness under full, single-missing, and multiple-missing modality settings compared with strong supervised and self-supervised baselines.