Enhancing Learned Knowledge in LoRA Adapters Through Efficient Contrastive Decoding on Ascend NPUs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

LoRA-finetuned models often suffer from base-model bias in complex reasoning tasks, leading to ineffective activation of task-specific knowledge. To address this, we propose Contrastive LoRA Decoding (CoLD), the first framework to integrate contrastive decoding into the LoRA inference phase. CoLD quantifies token-level distribution discrepancies between the expert (LoRA-adapted) and base models using KL or JS divergence, dynamically suppressing generic outputs while enhancing task-specific generation. Furthermore, we design a lightweight dual-model parallel scoring kernel optimized for Ascend NPUs, enabling efficient inference with zero additional GPU memory overhead. Experiments demonstrate that CoLD achieves up to a 5.54% absolute accuracy gain across multiple downstream tasks and reduces end-to-end latency by 28% compared to greedy decoding—significantly outperforming conventional methods. The framework has been successfully deployed in Huawei Cloud’s production environment.

Technology Category

Application Category

📝 Abstract

Huawei Cloud users leverage LoRA (Low-Rank Adaptation) as an efficient and scalable method to fine-tune and customize large language models (LLMs) for application-specific needs. However, tasks that require complex reasoning or deep contextual understanding are often hindered by biases or interference from the base model when using typical decoding methods like greedy or beam search. These biases can lead to generic or task-agnostic responses from the base model instead of leveraging the LoRA-specific adaptations. In this paper, we introduce Contrastive LoRA Decoding (CoLD), a novel decoding framework designed to maximize the use of task-specific knowledge in LoRA-adapted models, resulting in better downstream performance. CoLD uses contrastive decoding by scoring candidate tokens based on the divergence between the probability distributions of a LoRA-adapted expert model and the corresponding base model. This approach prioritizes tokens that better align with the LoRA's learned representations, enhancing performance for specialized tasks. While effective, a naive implementation of CoLD is computationally expensive because each decoding step requires evaluating multiple token candidates across both models. To address this, we developed an optimized kernel for Huawei's Ascend NPU. CoLD achieves up to a 5.54% increase in task accuracy while reducing end-to-end latency by 28% compared to greedy decoding. This work provides practical and efficient decoding strategies for fine-tuned LLMs in resource-constrained environments and has broad implications for applied data science in both cloud and on-premises settings.

Problem

Research questions and friction points this paper is trying to address.

Reducing base model bias in LoRA-adapted LLMs

Improving task-specific knowledge utilization in LoRA

Optimizing contrastive decoding efficiency for NPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive LoRA Decoding (CoLD) enhances task-specific knowledge

CoLD uses divergence scoring between expert and base models

Optimized Ascend NPU kernel reduces latency by 28%

🔎 Similar Papers

No similar papers found.