🤖 AI Summary
This work proposes a sparse fine-tuning strategy based on layer importance assessment to address the issue of excessive trainable parameters in parameter-efficient fine-tuning of large language models. By leveraging similarity metrics such as Centered Kernel Alignment (CKA) to analyze representational changes across layers, the method identifies the most critical layers that contribute significantly to downstream tasks and applies LoRA or its variants exclusively to these layers. The approach is orthogonal and compatible with existing LoRA techniques, substantially reducing the number of trainable parameters. Experimental results across diverse benchmarks—including GLUE, mathematical reasoning, code generation, and multimodal tasks—demonstrate up to a 50% reduction in trainable parameters with negligible performance degradation or even slight improvements.
📝 Abstract
Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA