On the Possibility of Breaking Copyleft Licenses When Reusing Code Generated by ChatGPT

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work identifies the “copyleft contagion” risk arising from large language models (LLMs) generating code derived from copyleft-licensed sources (e.g., GPL), potentially causing developers to inadvertently violate licensing obligations. Through over 70,000 systematic code-generation experiments—combined with abstract syntax tree (AST) and token-level similarity detection, license provenance analysis, and prompt engineering—we conduct the first large-scale empirical assessment of this risk. Results show that increasing the context window size significantly raises the reproduction rate of copyleft-licensed code (up to 12.7%), whereas raising the sampling temperature (≥0.8) suppresses it to below 1.3%. Crucially, we identify temperature as a controllable, compliance-relevant hyperparameter—an insight that yields both foundational empirical evidence and actionable intervention strategies for ensuring open-source license compliance in AI-augmented software development.

Technology Category

Application Category

📝 Abstract

AI assistants can help developers by recommending code to be included in their implementations (e.g., suggesting the implementation of a method from its signature). Although useful, these recommendations may mirror copyleft code available in public repositories, exposing developers to the risk of reusing code that they are allowed to reuse only under certain constraints (e.g., a specific license for the derivative software). This paper presents a large-scale study about the frequency and magnitude of this phenomenon in ChatGPT. In particular, we generate more than 70,000 method implementations using a range of configurations and prompts, revealing that a larger context increases the likelihood of reproducing copyleft code, but higher temperature settings can mitigate this issue.

Problem

Research questions and friction points this paper is trying to address.

Investigates copyleft license violations in AI-generated code.

Analyzes ChatGPT's tendency to reproduce restricted public code.

Explores settings to reduce copyleft code replication in AI.

Innovation

Methods, ideas, or system contributions that make the work stand out.

ChatGPT generates method implementations

Large-scale study on copyleft code reuse

Temperature settings mitigate code reproduction

🔎 Similar Papers

LiCoEval: Evaluating LLMs on License Compliance in Code Generation