ProLAP: Probabilistic Language-Audio Pre-Training

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing language-audio joint representation learning predominantly relies on deterministic embeddings, implicitly assuming a one-to-one semantic mapping—yet real-world semantics inherently exhibit many-to-many relationships. This work proposes the first probabilistic joint representation learning framework, employing distributional embeddings to explicitly model semantic multiplicity. We introduce a hierarchical inclusion loss to capture semantic hierarchy and a masked exclusion loss to enhance fine-grained discriminability, enabling efficient training under few-shot settings. To our knowledge, this is the first approach to incorporate probabilistic modeling into language-audio pretraining. Extensive experiments demonstrate significant improvements over deterministic baselines on audio-text retrieval. Moreover, we design a novel audio traversal task to empirically validate the framework’s ability to encode semantic hierarchies—confirming that the learned representations support structured, interpretable traversal across acoustic and linguistic levels.

Technology Category

Application Category

📝 Abstract
Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is inherently many-to-many: one audio segment can be described by multiple captions and vice versa. To address this, we propose Probabilistic Language-Audio Pre-training (ProLAP), which models multiplicity as the spread of probability distributions in a joint language-audio embedding space. To train the intra-modal hierarchical relationship effectively, we also introduce two objectives: (i) hierarchical inclusion loss to promote semantic hierarchical understanding of inputs and (ii) mask repulsive loss to improve the efficiency of learning when optimizing the hierarchical inclusion loss. With this training strategy, our model can learn the hierarchical structure inherent in the data even from small datasets, in contrast to prior probabilistic approaches that rely on large-scale datasets. In our experiments, ProLAP outperforms existing deterministic approaches on audio-text retrieval tasks. Moreover, through experiments on the audio traversal task introduced in this paper, we demonstrate that ProLAP captures the plausible semantic hierarchy.
Problem

Research questions and friction points this paper is trying to address.

Modeling many-to-many relationships between audio and text
Learning hierarchical structures from small datasets effectively
Improving audio-text retrieval with probabilistic embedding distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Models probabilistic distributions in joint embedding space
Introduces hierarchical inclusion loss for semantic understanding
Uses mask repulsive loss to improve learning efficiency
🔎 Similar Papers
No similar papers found.