Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges posed by weakly paired, unlabeled audio-visual corpora—such as the availability of only pre-extracted features, multi-event segments, and spurious co-occurrences—by proposing HSC-MAE, a dual-path teacher-student framework that, for the first time in an unsupervised setting, jointly models three forms of semantic correspondence: global canonical correlation, local neighborhood semantics, and sample-wise conditional sufficiency. The method integrates Deep Canonical Correlation Analysis (DCCA) for global alignment, leverages teacher-mined soft top-k affinities to capture local relationships, and enhances sample discriminability via masked autoencoding. It further incorporates an exponential moving average (EMA) teacher, a soft top-k InfoNCE loss, learnable multi-task weights, and a geometric distillation mechanism. Evaluated on the AVE and VEGAS datasets, the approach substantially outperforms existing unsupervised baselines, achieving significant gains in mAP and demonstrating the robustness and well-structured nature of the learned representations.
📝 Abstract
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.
Problem

Research questions and friction points this paper is trying to address.

audio-visual representation learning
unsupervised learning
multimodal embedding
weakly paired data
semantic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Autoencoder
Semantic Correlation
Canonical Correlation Analysis
Teacher-Student Framework
Unsupervised Multimodal Learning
🔎 Similar Papers
No similar papers found.