HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from inefficient training—requiring thousands of GPU hours—due to coarse-grained visual-textual alignment imposed by standard vision encoders (e.g., CLIP, SAM), which lack multi-granularity semantic coordination with language. To address this, we propose HyperET: the first MLLM training paradigm incorporating hyperbolic geometry, enabling flexible visual–textual alignment at arbitrary granularity levels via learnable, dynamic hyperbolic radii. HyperET innovatively employs Möbius matrix multiplication with parameter-efficient diagonal, block-diagonal, and banded learnable matrices to explicitly model hierarchical multimodal structures within hyperbolic space. Extensive experiments demonstrate that HyperET significantly improves both pretraining and fine-tuning performance across multiple multimodal benchmarks (e.g., LLaVA, MMStar, MMMU), while introducing fewer than 1% additional parameters and substantially reducing computational overhead.

Technology Category

Application Category

📝 Abstract
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with Möbius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1% additional parameters.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient cross-modal alignment in multi-modal large language models
Leverages hyperbolic space to bridge visual-textual granularity gaps
Optimizes visual representations with dynamic hyperbolic radius adjustment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperbolic space bridges visual-textual granularity gaps
Dynamic hyperbolic radius adjusts alignment granularity
Learnable matrices enable efficient parametrization strategy
🔎 Similar Papers
No similar papers found.
Zelin Peng
Zelin Peng
Shanghai Jiao Tong University
Computer VisionMedical Image Processing
Z
Zhengqin Xu
State Key Laboratory of Infrared Physics, Shanghai Institute of Technical Physics, CAS
Q
Qingyang Liu
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, SJTU
X
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, SJTU
W
Wei Shen
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, SJTU