🤖 AI Summary
To address the challenge of modeling fine-grained audio-visual interactions in active speaker detection (ASD) under unconstrained settings—where conventional late fusion fails—we propose HiGate, a hierarchical gated cross-modal fusion architecture that adaptively injects audio-visual contextual features across Transformer layers for deep cross-modal collaboration. Our method builds upon strong pre-trained unimodal encoders and a Transformer backbone, incorporating multi-depth gated feature injection. We introduce two novel auxiliary supervision signals: Mask Alignment Loss (MAL) and Over-Positive Penalty (OPP), which jointly suppress visual false activations and enhance modality consistency. Evaluated on Ego4D-ASD, UniTalk, and WASD, HiGate achieves state-of-the-art mAP scores of 77.8%, 86.1%, and 96.1%, respectively, and demonstrates superior cross-domain generalization.
📝 Abstract
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.