TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

πŸ“… 2025-05-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Lightweight vision-language models (VLMs) suffer from a cross-modal alignment bottleneck due to the limited representational capacity of their language encoders. Method: This paper is the first to attribute this bottleneck to insufficient effective mutual information (EMI) between modalities. We propose TinyAlignβ€”a retrieval-augmented generation (RAG)-based framework that constructs an updatable multimodal memory bank and employs a lightweight connector to enable dynamic contextual injection and alignment optimization. The method integrates EMI-theoretic analysis, efficient memory retrieval, and parameter-efficient fine-tuning. Contribution/Results: TinyAlign substantially reduces training loss and accelerates convergence. It achieves baseline performance using only 40% of the fine-tuning data, consistently improving alignment quality and generalization across multiple downstream tasks. The approach significantly enhances data efficiency and adaptability to resource-constrained settings, without increasing model size or inference latency.

Technology Category

Application Category

πŸ“ Abstract
Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight models with limited representational capacity. In this work, we investigate this alignment bottleneck through the lens of mutual information, demonstrating that the constrained capacity of the language model inherently limits the Effective Mutual Information (EMI) between multimodal inputs and outputs, thereby compromising alignment quality. To address this challenge, we propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank to enrich multimodal inputs and enhance their alignment. Extensive empirical evaluations reveal that TinyAlign significantly reduces training loss, accelerates convergence, and enhances task performance. Remarkably, it allows models to achieve baseline-level performance with only 40% of the fine-tuning data, highlighting exceptional data efficiency. Our work thus offers a practical pathway for developing more capable lightweight VLMs while introducing a fresh theoretical lens to better understand and address alignment bottlenecks in constrained multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

Lightweight VLMs face modal alignment bottlenecks due to limited representational capacity.
Constrained language models limit Effective Mutual Information in multimodal alignment.
TinyAlign enhances alignment by retrieving context to enrich multimodal inputs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses mutual information to analyze alignment bottlenecks
Introduces TinyAlign with retrieval-augmented context enrichment
Achieves high data efficiency with 40% fine-tuning data
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuanze Hu
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University
Z
Zhaoxin Fan
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University; Hangzhou International Innovation Institute, Beihang University
X
Xinyu Wang
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University
G
Gen Li
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University
Y
Ye Qiu
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University
Z
Zhichao Yang
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University
W
Wenjun Wu
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University; Hangzhou International Innovation Institute, Beihang University
Kejian Wu
Kejian Wu
XREAL Inc; PhD University of Minnesota
SLAMEstimationComputer VisionAugmented RealityArtificial Intelligence
Y
Yifan Sun
Xreal
Xiaotie Deng
Xiaotie Deng
Chair Professor of Computer Science, Peking University, Beijing, China
Algorithmic Game TheoryApproximate ComputingParallel ComputingCombinatorial Optimization
J
Jin Dong
Peking University