MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work proposes a low-rank multimodal prompting framework to address the parameter inefficiency of existing vision-language prompting methods, which often introduce a large number of trainable parameters and thereby undermine their original efficiency advantage. By leveraging low-rank decomposition to parameterize layer-wise prompts in both visual and textual encoders, the method requires only 11.5K trainable parameters—approximately one-thousandth of comparable approaches. The framework innovatively integrates a self-regulated consistency loss, uniform drift correction, and a cross-modal shared up-projection mechanism. Extensive experiments across three benchmarks and eleven datasets demonstrate its superior performance, achieving a harmonic mean accuracy of 79.70% in generalized zero-shot learning from base to novel classes, significantly outperforming current state-of-the-art methods while maintaining exceptional parameter efficiency and high accuracy.

Technology Category

Application Category

📝 Abstract

Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70\% on base-to-novel generalization.

Problem

Research questions and friction points this paper is trying to address.

prompt learning

vision-language adaptation

parameter efficiency

multi-modal prompting

low-rank

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank prompting

multi-modal adaptation

parameter-efficient tuning