🤖 AI Summary
This work addresses the privacy risks associated with large language model (LLM)-based code completion in integrated development environments (IDEs), which can inadvertently leak users’ private code and are vulnerable to membership inference attacks. The study presents the first systematic integration of differential privacy (DP) into the fine-tuning of an LLM for Kotlin code completion within an IDE setting. Experimental results demonstrate that the proposed DP-enhanced approach maintains utility comparable to non-private models—even when trained on data reduced by two orders of magnitude—while significantly mitigating privacy leakage: the area under the curve (AUC) of membership inference attacks drops from 0.901 to 0.606, approaching random guessing. This achieves strong privacy guarantees with minimal degradation in model performance.
📝 Abstract
Modern Integrated Development Environments (IDEs) increasingly leverage Large Language Models (LLMs) to provide advanced features like code autocomplete. While powerful, training these models on user-written code introduces significant privacy risks, making the models themselves a new type of data vulnerability. Malicious actors can exploit this by launching attacks to reconstruct sensitive training data or infer whether a specific code snippet was used for training. This paper investigates the use of Differential Privacy (DP) as a robust defense mechanism for training an LLM for Kotlin code completion. We fine-tune a \texttt{Mellum} model using DP and conduct a comprehensive evaluation of its privacy and utility. Our results demonstrate that DP provides a strong defense against Membership Inference Attacks (MIAs), reducing the attack's success rate close to a random guess (AUC from 0.901 to 0.606). Furthermore, we show that this privacy guarantee comes at a minimal cost to model performance, with the DP-trained model achieving utility scores comparable to its non-private counterpart, even when trained on 100x less data. Our findings suggest that DP is a practical and effective solution for building private and trustworthy AI-powered IDE features.