Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing audio-visual representation learning methods heavily rely on manually annotated labels for cross-modal alignment and struggle to capture implicit semantic associations. To address this, we propose an unsupervised/weakly supervised learning framework. Our method introduces: (1) a novel progressive self-distillation mechanism that dynamically discovers label-agnostic deep audio-visual correlations via distribution-driven soft alignment; and (2) a cross-modal triplet loss that jointly optimizes feature-space geometry and alignment probability distributions. Evaluated on multiple audio-visual retrieval and matching benchmarks, our approach achieves significant performance gains—particularly under low-label-rate regimes—demonstrating superior embedding robustness and generalization. This work establishes a new paradigm for weakly supervised multimodal representation learning.

Technology Category

Application Category

📝 Abstract

Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Learning

Label Dependency

Feature Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Loss

Self-paced Learning

Soft Alignment

🔎 Similar Papers

Sequential Contrastive Audio-Visual Learning