Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address challenges in multimodal perception—including difficulty in audio-video-text joint modeling, weak cross-modal alignment, and poor task generalization—this paper introduces the PE-AV Perception Encoder family. We propose a novel multi-granularity (segment-level + frame-level) contrastive learning framework with ten jointly optimized contrastive objectives; break beyond single-domain audio constraints by unifying native alignment among speech, music, sound effects, video, and text; and construct a billion-scale synthetic captioned audio-video pair data engine. Leveraging scalable contrastive learning and our PE-A-Frame mechanism for fine-grained frame-level alignment, we establish a unified cross-modal embedding space. Our approach achieves new state-of-the-art results across standard audio-video benchmarks, significantly improves zero-shot transfer performance, and—uniquely—enables high-accuracy sound event detection and cross-modal spoken-language retrieval.

Technology Category

Application Category

📝 Abstract

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.

Problem

Research questions and friction points this paper is trying to address.

Develops unified cross-modal embeddings for audio-video-text understanding

Enables novel tasks like speech retrieval with large-scale multimodal supervision

Improves zero-shot performance via scaled contrastive learning objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaled contrastive learning for multimodal encoders

Unified cross-modal embeddings for audio-video-text tasks

Large-scale synthesized captions for audiovisual data supervision

🔎 Similar Papers

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization