๐ค AI Summary
This work addresses the challenges of spurious negative samples and missing cross-modal semantic associations in audio-visual embedding learning caused by sparse annotations. To mitigate these issues, we propose a novel learning framework that leverages soft-label prediction and an implicit interaction graph. Our approach employs a teacherโstudent architecture to generate reliable soft supervision signals and utilizes the GRaSP algorithm to construct a directed inter-class dependency graph. By incorporating graph-guided regularization and semantic alignment losses, the model effectively captures latent semantic dependencies among unannotated co-occurring events. Experiments on the AVE and VEGAS benchmarks demonstrate that the proposed method significantly improves mean average precision (mAP), enhancing both semantic consistency and robustness in cross-modal embeddings.
๐ Abstract
Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled"train"might also contain motorcycle audio and visual, because"motorcycle"is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g.,"Train (visual)"->"Motorcycle (audio)") that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (mAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.