🤖 AI Summary
To address challenges in multimodal perception—including difficulty in audio-video-text joint modeling, weak cross-modal alignment, and poor task generalization—this paper introduces the PE-AV Perception Encoder family. We propose a novel multi-granularity (segment-level + frame-level) contrastive learning framework with ten jointly optimized contrastive objectives; break beyond single-domain audio constraints by unifying native alignment among speech, music, sound effects, video, and text; and construct a billion-scale synthetic captioned audio-video pair data engine. Leveraging scalable contrastive learning and our PE-A-Frame mechanism for fine-grained frame-level alignment, we establish a unified cross-modal embedding space. Our approach achieves new state-of-the-art results across standard audio-video benchmarks, significantly improves zero-shot transfer performance, and—uniquely—enables high-accuracy sound event detection and cross-modal spoken-language retrieval.
📝 Abstract
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.