How Far Are We from Predicting Missing Modalities with Foundation Models?

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal foundation models face two key bottlenecks in missing-modality prediction: insufficient fine-grained semantic extraction and lack of robust validation for generated outputs. Through systematic evaluation of 42 model variants, we identify pervasive deficiencies—including poor cross-modal generation consistency and weak calibration capability. To address these, we propose the first agent-based framework specifically designed for missing-modality prediction. It integrates modality-aware contextual feature mining, dynamic policy modeling, and a self-iterative refinement mechanism, enabling unsupervised generation and calibration via internal feedback-driven multi-round validation. Experiments demonstrate substantial improvements: FID decreases by ≥14% for image-missing prediction and MER decreases by ≥10% for text-missing prediction, significantly outperforming existing baselines. Our work establishes a novel paradigm for robust cross-modal reasoning in multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality prediction remains underexplored. To investigate this, we categorize existing approaches into three representative paradigms, encompassing a total of 42 model variants, and conduct a comprehensive evaluation in terms of prediction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned predictions. To address these challenges, we propose an agentic framework tailored for missing modality prediction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a extit{self-refinement mechanism}, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image prediction by at least 14% and MER for missing text prediction by at least 10% compared to baselines.
Problem

Research questions and friction points this paper is trying to address.

Evaluating foundation models for missing modality prediction accuracy
Addressing fine-grained semantic extraction limitations in current models
Proposing agentic framework for dynamic modality-aware feature mining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework for dynamic modality-aware mining
Self-refinement mechanism for iterative quality enhancement
Improved accuracy via richer semantic feature extraction
🔎 Similar Papers
No similar papers found.