Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio captioning faces two key challenges: exposure bias induced by teacher-forcing training and inaccurate acoustic–linguistic alignment due to the neglect of temporal structure in existing cross-modal contrastive methods. To address these, we propose a bias-free stochastic decoding framework. Its core contributions are: (1) the first differentiable, unbiased sliced Wasserstein RBF kernel with error convergence rate O(L⁻¹/²), explicitly modeling temporal alignment across modalities; and (2) integration of rotary position encoding with Monte Carlo estimation to enhance cross-modal similarity measurement under contrastive learning. Evaluated on AudioCaps and Clotho, our method significantly improves generated caption length, lexical diversity, and text–audio self-retrieval accuracy—demonstrating superior alignment and generalization without exposure bias.

Technology Category

Application Category

📝 Abstract
Teacher-forcing training for audio captioning usually leads to exposure bias due to training and inference mismatch. Prior works propose the contrastive method to deal with caption degeneration. However, the contrastive method ignores the temporal information when measuring similarity across acoustic and linguistic modalities, leading to inferior performance. In this work, we develop the temporal-similarity score by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel equipped with rotary positional embedding to account for temporal information across modalities. In contrast to the conventional sliced Wasserstein RBF kernel, we can form an unbiased estimation of USW-RBF kernel via Monte Carlo estimation. Therefore, it is well-suited to stochastic gradient optimization algorithms, and its approximation error decreases at a parametric rate of $mathcal{O}(L^{-1/2})$ with $L$ Monte Carlo samples. Additionally, we introduce an audio captioning framework based on the unbiased sliced Wasserstein kernel, incorporating stochastic decoding methods to mitigate caption degeneration during the generation process. We conduct extensive quantitative and qualitative experiments on two datasets, AudioCaps and Clotho, to illustrate the capability of generating high-quality audio captions. Experimental results show that our framework is able to increase caption length, lexical diversity, and text-to-audio self-retrieval accuracy.
Problem

Research questions and friction points this paper is trying to address.

Addresses exposure bias in teacher-forcing training
Improves temporal information in similarity measurement
Enhances audio captioning quality and diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unbiased sliced Wasserstein RBF kernel
Rotary positional embedding integration
Stochastic decoding for caption generation
🔎 Similar Papers
No similar papers found.