Enhancing Video Music Recommendation with Transformer-Driven Audio-Visual Embeddings

πŸ“… 2024-09-30
πŸ›οΈ 2024 IEEE 5th International Symposium on the Internet of Sounds (IS2)
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the reliance on manual annotations and challenges in cross-modal temporal alignment in automatic video background music generation, this paper proposes TIVM, a self-supervised cross-modal audio-video matching framework. TIVM employs a dual-stream Transformer encoder to independently model the temporal structures of audio and video modalities, and leverages InfoNCE contrastive learning to achieve fine-grained cross-modal alignment within a shared embedding spaceβ€”without requiring any human-annotated supervision. Crucially, it pioneers the integration of Transformers for joint audio-video representation learning, substantially enhancing temporal semantic consistency. Extensive experiments on multiple benchmark datasets demonstrate that TIVM outperforms state-of-the-art methods by a significant 12.6% improvement in Recall@10, validating its effectiveness and generalizability for unsupervised cross-modal matching.

Technology Category

Application Category

πŸ“ Abstract
A fitting soundtrack can help a video better convey its content and provide a better immersive experience. This paper introduces a novel approach utilizing self-supervised learning and contrastive learning to automatically recommend audio for video content, thereby eliminating the need for manual labeling. We use a dual-branch cross-modal embedding model that maps both audio and video features into a common low-dimensional space. The fit of various audio-video pairs can then be mod-eled as inverse distance measure. In addition, a comparative analysis of various temporal encoding methods is presented, emphasizing the effectiveness of transformers in managing the temporal information of audio-video matching tasks. Through multiple experiments, we demonstrate that our model TIVM, which integrates transformer encoders and using InfoN Celoss, significantly improves the performance of audio-video matching and surpasses traditional methods.
Problem

Research questions and friction points this paper is trying to address.

Automatically recommend audio for video content using self-supervised learning.
Map audio and video features into a common low-dimensional space.
Improve audio-video matching performance with transformer encoders.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning for audio-video embedding
Dual-branch cross-modal embedding model
Transformer encoders for temporal information management
πŸ”Ž Similar Papers
No similar papers found.