ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multimodal data fusion and generation in ecological monitoring, where heterogeneous modalities—including images, text, spectrograms, and time-series sensor signals—must be jointly modeled. We propose P3M (Probabilistic Masked Multimodal Embedding), a unified probabilistic framework that learns modality-agnostic representations by randomly masking any subset of modalities and reconstructing their embeddings within a shared latent space, enabling bidirectional cross-modal generation and missing-modality inference. To enhance flexibility, we introduce a modality feasibility analysis mechanism and a linear-probe-guided cross-modal similarity fusion strategy, supporting dynamic modality composition and hybrid retrieval. Experiments demonstrate that P3M significantly outperforms state-of-the-art methods on downstream ecological tasks—including cross-modal retrieval, species identification, and environmental state prediction—validating its strong generalization capability and domain-specific adaptability.

Technology Category

Application Category

📝 Abstract
We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.
Problem

Research questions and friction points this paper is trying to address.

Generating multimodal representations for ecology when some modalities are missing
Analyzing feasibility of fusing various modalities for downstream ecological tasks
Developing cross-modal retrieval approach mixing inter-modal and intra-modal similarities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic masked multimodal embedding for ecology
Learns to infer missing modalities from context
Supports modality inversion in embedding space
🔎 Similar Papers
No similar papers found.