🤖 AI Summary
This paper addresses the critical issue of large language models (LLMs) overlooking open-source license information during code generation, thereby introducing intellectual property risks. To this end, we introduce LiCoEval—the first benchmark explicitly designed to evaluate LLMs’ compliance with software licensing requirements. Methodologically, we empirically define “high code similarity” using Jaccard similarity and AST-based clone detection, then integrate license metadata parsing, manual verification, and statistical analysis to systematically assess LLMs’ accuracy in attributing copyleft licenses when generating highly similar code. Our key contributions are: (1) the first LLM evaluation framework dedicated to open-source license compliance; and (2) an empirical finding that, across 14 state-of-the-art models, although only 0.88%–2.01% of generated code exhibits high similarity to licensed code, over 90% of such cases fail to correctly attribute the corresponding copyleft license—revealing a severe and widespread compliance gap.
📝 Abstract
Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for"striking similarity"that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.