🤖 AI Summary
Existing watermarking schemes for LoRA adapters fail under black-box operations—such as multi-model stacking or sign inversion—due to fragility in watermark detection.
Method: This paper proposes the first black-box watermarking scheme robust to additive composition and sign flipping of LoRA adapters. It introduces (1) a complementary “Yin-Yang” dual-watermark mechanism embedded jointly in the LoRA weight space; (2) a shadow model-assisted distillation framework that explicitly models weight perturbations to enhance watermark stability; and (3) a dual-path verification architecture enabling reliable provenance tracing without access to the original model’s parameters.
Results: Extensive experiments on both large language models and diffusion models demonstrate near-perfect watermark detection accuracy (>99.5%), significantly outperforming prior methods. Our approach achieves, for the first time, strong robustness against adversarial multi-LoRA fusion and sign-inversion abuse—key challenges in practical LoRA deployment.
📝 Abstract
LoRA (Low-Rank Adaptation) has achieved remarkable success in the parameter-efficient fine-tuning of large models. The trained LoRA matrix can be integrated with the base model through addition or negation operation to improve performance on downstream tasks. However, the unauthorized use of LoRAs to generate harmful content highlights the need for effective mechanisms to trace their usage. A natural solution is to embed watermarks into LoRAs to detect unauthorized misuse. However, existing methods struggle when multiple LoRAs are combined or negation operation is applied, as these can significantly degrade watermark performance. In this paper, we introduce LoRAGuard, a novel black-box watermarking technique for detecting unauthorized misuse of LoRAs. To support both addition and negation operations, we propose the Yin-Yang watermark technique, where the Yin watermark is verified during negation operation and the Yang watermark during addition operation. Additionally, we propose a shadow-model-based watermark training approach that significantly improves effectiveness in scenarios involving multiple integrated LoRAs. Extensive experiments on both language and diffusion models show that LoRAGuard achieves nearly 100% watermark verification success and demonstrates strong effectiveness.