Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing vision-language models represent text hierarchically but encode images with a single flat feature, leading to asymmetric modality alignment. Method: We propose Cross-Tree Alignment—a framework that (i) constructs hierarchical tree-structured features for both images and text; (ii) introduces, for the first time, an optimal intermediate manifold between heterogeneous-curvature hyperbolic manifolds, proving its theoretical uniqueness and designing a KL-divergence-based cross-manifold distribution distance; and (iii) jointly embeds hierarchical feature trees in hyperbolic space via a semantic-aware visual encoder and text-guided coarse-to-fine cross-attention. Contribution/Results: Our method achieves significant improvements over strong baselines on few-shot and cross-domain image classification, demonstrating superior effectiveness and generalization capability.

Technology Category

Application Category

📝 Abstract

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

Problem

Research questions and friction points this paper is trying to address.

Aligns hierarchical image-text features across asymmetric modalities

Models feature trees on heterogeneous hyperbolic manifolds with distinct curvatures

Learns intermediate manifold for cross-manifold alignment using KL divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-like hierarchical features for both image and text

Semantic-aware visual extraction using cross-attention mechanism

Manifold alignment via KL distance on hyperbolic spaces

🔎 Similar Papers

No similar papers found.