🤖 AI Summary
This work addresses the ambiguity in information allocation and semantic leakage arising from the lack of inductive bias in symmetric architectures for cross-modal generalization. To this end, we propose an asymmetric hierarchical anchoring framework that leverages hierarchical discrete representations derived from audio residual vector quantization (RVQ) as structured semantic anchors to guide video feature distillation into a shared semantic space. The framework incorporates a gradient reversal layer (GRL) for adversarial disentanglement and employs local sliding alignment (LSA) to enhance fine-grained temporal consistency. Compared to conventional mutual information estimators, our approach effectively suppresses semantic leakage in modality-specific branches. Extensive experiments on the AVE and AVVP benchmarks demonstrate significant improvements over symmetric baselines, and speaker face disentanglement studies further validate the semantic coherence and disentanglement capability of the learned representations.
📝 Abstract
Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage in modality-specific branches, and introduce Local Sliding Alignment (LSA) to encourage fine-grained temporal alignment across modalities. Extensive experiments on AVE and AVVP benchmarks demonstrate that AHA consistently outperforms symmetric baselines in cross-modal transfer. Additional analyses on talking-face disentanglement experiment further validate that the learned representations exhibit improved semantic consistency and disentanglement, indicating the broader applicability of the proposed framework.