Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the ambiguity in information allocation and semantic leakage arising from the lack of inductive bias in symmetric architectures for cross-modal generalization. To this end, we propose an asymmetric hierarchical anchoring framework that leverages hierarchical discrete representations derived from audio residual vector quantization (RVQ) as structured semantic anchors to guide video feature distillation into a shared semantic space. The framework incorporates a gradient reversal layer (GRL) for adversarial disentanglement and employs local sliding alignment (LSA) to enhance fine-grained temporal consistency. Compared to conventional mutual information estimators, our approach effectively suppresses semantic leakage in modality-specific branches. Extensive experiments on the AVE and AVVP benchmarks demonstrate significant improvements over symmetric baselines, and speaker face disentanglement studies further validate the semantic coherence and disentanglement capability of the learned representations.

Technology Category

Application Category

📝 Abstract
Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage in modality-specific branches, and introduce Local Sliding Alignment (LSA) to encourage fine-grained temporal alignment across modalities. Extensive experiments on AVE and AVVP benchmarks demonstrate that AHA consistently outperforms symmetric baselines in cross-modal transfer. Additional analyses on talking-face disentanglement experiment further validate that the learned representations exhibit improved semantic consistency and disentanglement, indicating the broader applicability of the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Cross-Modal Generalization
Audio-Visual Joint Representation
Information Allocation Ambiguity
Semantic Leakage
Modality Disentanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Hierarchical Anchoring
Cross-Modal Generalization
Residual Vector Quantization
Adversarial Decoupling
Local Sliding Alignment
B
Bixing Wu
Zhejiang University, China
Y
Yuhong Zhao
Zhejiang University, China
Z
Zongli Ye
Zhejiang University, China; MMLab, Chinese University of Hong Kong, China
Jiachen Lian
Jiachen Lian
UC Berkeley
precision healthcarespeech processingmachine learning
Xiangyu Yue
Xiangyu Yue
The Chinese University of Hong Kong / UC Berkeley / Stanford University / NJU
Artificial IntelligenceComputer VisionMulti-modal Learning
G
G. Anumanchipalli
University of California, Berkeley, USA