🤖 AI Summary
Multimodal learning faces a fundamental challenge—“multiplicity”: the inherent many-to-many semantic relationships across modalities, which mainstream approaches oversimplify via deterministic one-to-one alignment assumptions, thereby neglecting semantic abstraction, representational asymmetry, and task-dependent ambiguity. This work formally introduces and systematizes the concept of multiplicity, demonstrating its pervasive impact across data construction, model training, and evaluation—inducing training instability, unreliable assessment, and degraded data quality. Through conceptual analysis and causal attribution modeling, we propose a multiplicity-aware learning framework and a corresponding data paradigm that transcends conventional deterministic alignment. Our contributions include: (i) a rigorous theoretical characterization of multiplicity as a core property of multimodal semantics; (ii) principled methods to quantify and mitigate multiplicity-induced uncertainty; and (iii) a scalable research pathway grounded in this new theoretical benchmark for multimodal learning.
📝 Abstract
Multimodal learning has seen remarkable progress, particularly with the emergence of large-scale pre-training across various modalities. However, most current approaches are built on the assumption of a deterministic, one-to-one alignment between modalities. This oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. This phenomenon, named multiplicity, is not a side-effect of noise or annotation error, but an inevitable outcome of semantic abstraction, representational asymmetry, and task-dependent ambiguity in multimodal tasks. This position paper argues that multiplicity is a fundamental bottleneck that manifests across all stages of the multimodal learning pipeline: from data construction to training and evaluation. This paper examines the causes and consequences of multiplicity, and highlights how multiplicity introduces training uncertainty, unreliable evaluation, and low dataset quality. This position calls for new research directions on multimodal learning: novel multiplicity-aware learning frameworks and dataset construction protocols considering multiplicity.