🤖 AI Summary
This work addresses the limitations of conventional multimodal representation learning, which relies on a shared-private dichotomy and struggles to capture latent factors shared only among subsets of modalities, often leading to excessive alignment of irrelevant signals and loss of complementary information. To overcome this, the authors propose a Hierarchical Contrastive Learning (HCL) framework that introduces, for the first time, a hierarchical latent variable structure to explicitly model globally shared, partially shared, and modality-specific components. A structure-aware contrastive objective is designed to align only those factors that are genuinely shared. Theoretical analysis establishes identifiability of the model without requiring correlation assumptions and provides recovery guarantees for the loading matrices along with bounds on prediction risk. Experiments demonstrate that HCL accurately recovers the hierarchical structure, effectively selects task-relevant components, and significantly improves representation quality and downstream prediction performance on multimodal electronic health records.
📝 Abstract
Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only subsets of modalities, and ignoring such partial sharing can over-align unrelated signals and obscure complementary information. We propose Hierarchical Contrastive Learning (HCL), a framework that learns globally shared, partially shared, and modality-specific representations within a unified model. HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction. Simulations show accurate recovery of hierarchical structure and effective selection of task-relevant components. On multimodal electronic health records, HCL yields more informative representations and consistently improves predictive performance.