Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the ambiguity in information allocation and semantic leakage arising from the lack of inductive bias in symmetric architectures for cross-modal generalization. To this end, we propose an asymmetric hierarchical anchoring framework that leverages hierarchical discrete representations derived from audio residual vector quantization (RVQ) as structured semantic anchors to guide video feature distillation into a shared semantic space. The framework incorporates a gradient reversal layer (GRL) for adversarial disentanglement and employs local sliding alignment (LSA) to enhance fine-grained temporal consistency. Compared to conventional mutual information estimators, our approach effectively suppresses semantic leakage in modality-specific branches. Extensive experiments on the AVE and AVVP benchmarks demonstrate significant improvements over symmetric baselines, and speaker face disentanglement studies further validate the semantic coherence and disentanglement capability of the learned representations.

Technology Category

Application Category

📝 Abstract

Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage in modality-specific branches, and introduce Local Sliding Alignment (LSA) to encourage fine-grained temporal alignment across modalities. Extensive experiments on AVE and AVVP benchmarks demonstrate that AHA consistently outperforms symmetric baselines in cross-modal transfer. Additional analyses on talking-face disentanglement experiment further validate that the learned representations exhibit improved semantic consistency and disentanglement, indicating the broader applicability of the proposed framework.

Problem

Research questions and friction points this paper is trying to address.

Cross-Modal Generalization

Audio-Visual Joint Representation

Information Allocation Ambiguity

Semantic Leakage

Modality Disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Hierarchical Anchoring

Cross-Modal Generalization

Residual Vector Quantization