🤖 AI Summary
This work addresses the instability of extremely low-rank (e.g., rank-1) parameter-efficient fine-tuning in multimodal large language models, which often arises from a mismatch between the update direction and the geometric structure of pre-trained features. The authors identify that the anisotropic distributions of visual and textual pre-trained features cause early gradient flow to be dominated by directions along the “modality gap.” To mitigate this, they propose Gap-Init, a geometry-aware directional initialization method that aligns the rank-1 LoRA update direction with this gap while preserving zero initial updates. Experiments across multiple vision–language tasks and backbone architectures demonstrate that Gap-Init substantially improves the stability of rank-1 fine-tuning, achieving performance comparable to or even surpassing that of rank-8 baselines, thereby highlighting that proper alignment of the initial update direction is as critical as increasing rank.
📝 Abstract
Parameter-efficient fine-tuning (PEFT) is a standard way to adapt multimodal large language models, yet extremely low-rank settings -- especially rank-1 LoRA -- are often unstable. We show that this instability is not solely due to limited capacity: in the rank-1 regime, optimization is highly sensitive to the update direction. Concretely, pretrained vision and text features form mismatched anisotropic regions, yielding a dominant"gap"direction that acts like a translation component and disproportionately steers early gradients under rank-1 constraints. Analyzing pretrained representations, we identify a modality-gap axis that dominates early gradient flow, while a random rank-1 initialization is unlikely to align with it, leading to weak gradients and training collapse. We propose Gap-Init, a geometry-aware initialization that aligns the rank-1 LoRA direction with an estimated modality-gap vector from a small calibration set, while keeping the initial LoRA update zero. Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines. Our results suggest that at the extreme low-rank limit, initial alignment can matter as much as rank itself.