Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

πŸ“… 2025-10-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language models represent text hierarchically but encode images with a single flat feature, leading to asymmetric modality alignment. Method: We propose Cross-Tree Alignmentβ€”a framework that (i) constructs hierarchical tree-structured features for both images and text; (ii) introduces, for the first time, an optimal intermediate manifold between heterogeneous-curvature hyperbolic manifolds, proving its theoretical uniqueness and designing a KL-divergence-based cross-manifold distribution distance; and (iii) jointly embeds hierarchical feature trees in hyperbolic space via a semantic-aware visual encoder and text-guided coarse-to-fine cross-attention. Contribution/Results: Our method achieves significant improvements over strong baselines on few-shot and cross-domain image classification, demonstrating superior effectiveness and generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
Problem

Research questions and friction points this paper is trying to address.

Aligns hierarchical image-text features across asymmetric modalities
Models feature trees on heterogeneous hyperbolic manifolds with distinct curvatures
Learns intermediate manifold for cross-manifold alignment using KL divergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-like hierarchical features for both image and text
Semantic-aware visual extraction using cross-attention mechanism
Manifold alignment via KL distance on hyperbolic spaces
πŸ”Ž Similar Papers
No similar papers found.
Wu Wei
Wu Wei
ε—ζ–Ήη§‘ζŠ€ε€§ε­¦
lithium ion batteryanodesilicon
Xiaomeng Fan
Xiaomeng Fan
Beijing Institute of Technology
machine learningcomputer vision
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Z
Zhi Gao
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
Y
Yunde Jia
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University; Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Mehrtash Harandi
Mehrtash Harandi
Department of Electrical and Computer Systems Engineering, Monash University
Machine LearningComputer Vision