MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt-learning methods for multimodal visual recognition struggle to model cross-modal relationships and incur high computational overhead, particularly under modality missing scenarios caused by privacy constraints, data acquisition limitations, or resource scarcity. Method: We propose a parameter-efficient fine-tuning framework that introduces modality-shared low-rank adapters to enable bidirectional knowledge transfer between vision and language modalities, preserving modality-specific adaptation while enhancing cross-modal interaction. The method jointly optimizes both modality-specific and shared low-rank parameters. Contribution/Results: Our approach requires tuning only 0.11% of the full model parameters and reduces inference latency to 25.90% of state-of-the-art methods. On standard benchmarks, it achieves an average accuracy improvement of 5.24%, significantly boosting model robustness and practicality in incomplete-modality settings.

Technology Category

Application Category

📝 Abstract
Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally, combined with the modality-specific parameters, MoRA allows the backbone model to maintain inter-modality interaction and enable intra-modality flexibility. Extensive experiments on standard benchmarks demonstrate that MoRA achieves an average performance improvement in missing-modality scenarios by 5.24% and uses only 25.90% of the inference time compared to the SOTA method while requiring only 0.11% of trainable parameters compared to full fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Addresses missing modalities in vision-language models during inference
Enables cross-modal knowledge transfer while maintaining modality-specific adaptations
Reduces computational overhead while improving missing-modality scenario performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient fine-tuning with low-rank adaptation
Explicitly models cross-modal interactions bidirectionally
Combines modality-common and modality-specific parameters
🔎 Similar Papers
No similar papers found.
S
Shu Zhao
The Pennsylvania State University
Nilesh Ahuja
Nilesh Ahuja
Intel
Probabilistic methods in Machine LearningAnomaly DetectionComputer VisionImage and Video ProcessingSignal processing
Tan Yu
Tan Yu
NVIDIA
LLMRAGCross-modal searchadvertisingvision backbone
T
Tianyi Shen
The Pennsylvania State University
V
Vijay Narayanan
The Pennsylvania State University