🤖 AI Summary
Existing language-audio joint representation learning predominantly relies on deterministic embeddings, implicitly assuming a one-to-one semantic mapping—yet real-world semantics inherently exhibit many-to-many relationships. This work proposes the first probabilistic joint representation learning framework, employing distributional embeddings to explicitly model semantic multiplicity. We introduce a hierarchical inclusion loss to capture semantic hierarchy and a masked exclusion loss to enhance fine-grained discriminability, enabling efficient training under few-shot settings. To our knowledge, this is the first approach to incorporate probabilistic modeling into language-audio pretraining. Extensive experiments demonstrate significant improvements over deterministic baselines on audio-text retrieval. Moreover, we design a novel audio traversal task to empirically validate the framework’s ability to encode semantic hierarchies—confirming that the learned representations support structured, interpretable traversal across acoustic and linguistic levels.
📝 Abstract
Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is inherently many-to-many: one audio segment can be described by multiple captions and vice versa. To address this, we propose Probabilistic Language-Audio Pre-training (ProLAP), which models multiplicity as the spread of probability distributions in a joint language-audio embedding space. To train the intra-modal hierarchical relationship effectively, we also introduce two objectives: (i) hierarchical inclusion loss to promote semantic hierarchical understanding of inputs and (ii) mask repulsive loss to improve the efficiency of learning when optimizing the hierarchical inclusion loss. With this training strategy, our model can learn the hierarchical structure inherent in the data even from small datasets, in contrast to prior probabilistic approaches that rely on large-scale datasets. In our experiments, ProLAP outperforms existing deterministic approaches on audio-text retrieval tasks. Moreover, through experiments on the audio traversal task introduced in this paper, we demonstrate that ProLAP captures the plausible semantic hierarchy.