AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing adapter-based vision-language model (VLM) fine-tuning methods suffer from cross-layer parameter redundancy and limited representational capacity due to homogeneous adapter architectures. To address these limitations, we propose AdaRing—the first efficient fine-tuning framework integrating cross-layer tensor ring decomposition with heterogeneous adapter co-training. Its core contributions are: (1) modeling cross-layer adapter parameters via tensor ring decomposition to jointly learn a shared core and layer-specific slices, thereby enforcing low-rank structural consistency across layers; and (2) introducing a diversity-aware, rank-adaptive heterogeneous adapter optimization mechanism that enhances both representation learning and generalization. Evaluated on multiple vision-language benchmarks, AdaRing achieves state-of-the-art performance using only 10% of the trainable parameters required by prior methods—yielding a 90% improvement in parameter efficiency—and significantly advances lightweight VLM adaptation.

Technology Category

Application Category

📝 Abstract

Adapter-based fine-tuning has gained remarkable attention in adapting large pre-trained vision language models (VLMs) for a wide range of downstream tasks efficiently. In this paradigm, only the inserted adapters are fine-tuned, without the need for training the original VLM backbone. Existing works scale adapters by integrating them into every layer of VLMs to increase the capacity of adapters. However, these methods face two primary limitations: 1) limited compression rate due to ignoring cross-layer redundancy, and 2) limited representational capacity across homogeneous adapters. In this paper, we propose a novel vision-language fine-tuning framework based on cross-layer tensor ring decomposition (TRD) with the integration and collaboration of diverse adapters, called AdaRing, achieving ultra-light parameter-efficient adaptation of VLMs on various tasks. To remove the high redundancy that exists among adapters across layers, we exploit the tensor-level low-rankness to formulate adapters as layer-shared tensor cores and layer-specific slices. Moreover, guided by generalization-aware fine-tuning, diverse rank-driven adapters cooperate to handle tasks that require different representations. Our experiments show that the proposed AdaRing achieves the state-of-the-art performance while reducing average training parameters by 90%.

Problem

Research questions and friction points this paper is trying to address.

Reducing cross-layer redundancy in adapter-based fine-tuning

Enhancing representational capacity across homogeneous adapters

Achieving ultra-light parameter-efficient adaptation for VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer tensor ring decomposition for redundancy removal

Layer-shared tensor cores with layer-specific slices

Generalization-aware diverse rank-driven adapters collaboration

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models