Protecting Private Code in IDE Autocomplete using Differential Privacy

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the privacy risks associated with large language model (LLM)-based code completion in integrated development environments (IDEs), which can inadvertently leak users’ private code and are vulnerable to membership inference attacks. The study presents the first systematic integration of differential privacy (DP) into the fine-tuning of an LLM for Kotlin code completion within an IDE setting. Experimental results demonstrate that the proposed DP-enhanced approach maintains utility comparable to non-private models—even when trained on data reduced by two orders of magnitude—while significantly mitigating privacy leakage: the area under the curve (AUC) of membership inference attacks drops from 0.901 to 0.606, approaching random guessing. This achieves strong privacy guarantees with minimal degradation in model performance.

Technology Category

Application Category

📝 Abstract
Modern Integrated Development Environments (IDEs) increasingly leverage Large Language Models (LLMs) to provide advanced features like code autocomplete. While powerful, training these models on user-written code introduces significant privacy risks, making the models themselves a new type of data vulnerability. Malicious actors can exploit this by launching attacks to reconstruct sensitive training data or infer whether a specific code snippet was used for training. This paper investigates the use of Differential Privacy (DP) as a robust defense mechanism for training an LLM for Kotlin code completion. We fine-tune a \texttt{Mellum} model using DP and conduct a comprehensive evaluation of its privacy and utility. Our results demonstrate that DP provides a strong defense against Membership Inference Attacks (MIAs), reducing the attack's success rate close to a random guess (AUC from 0.901 to 0.606). Furthermore, we show that this privacy guarantee comes at a minimal cost to model performance, with the DP-trained model achieving utility scores comparable to its non-private counterpart, even when trained on 100x less data. Our findings suggest that DP is a practical and effective solution for building private and trustworthy AI-powered IDE features.
Problem

Research questions and friction points this paper is trying to address.

Differential Privacy
Code Autocomplete
Privacy Risk
Membership Inference Attacks
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Privacy
Code Autocomplete
Membership Inference Attack
Large Language Models
Privacy-Preserving Machine Learning
🔎 Similar Papers
No similar papers found.
E
Evgeny Grigorenko
JetBrains Research, Belgrade, Serbia
D
David Stanojević
JetBrains Research, Belgrade, Serbia
D
David Ilić
JetBrains Research, Belgrade, Serbia
Egor Bogomolov
Egor Bogomolov
JetBrains Research
machine learning for software engineering
Kostadin Cvejoski
Kostadin Cvejoski
JetBrains
LLMsDeep LearningPoint ProcessesDynamic Language Models