🤖 AI Summary
This study investigates the legal risks and performance impacts of training large language models (LLMs) on copyrighted material, focusing on Norwegian-language models. To address this, we develop a reproducible evaluation framework—first to systematically quantify the differential impact of three copyright-protected text categories (books, newspapers, and novels) on model performance. Our methodology integrates multi-task benchmarking, controlled-variable training experiments, copyright-text provenance tracing, and contribution attribution analysis. Results indicate that books and newspapers significantly improve performance across multiple Norwegian benchmarks, whereas novels degrade performance in several cases. We propose a novel data-impact assessment paradigm that jointly considers legal compliance and modeling efficacy, offering empirical grounding and methodological support for copyright-compliant training-data auditing and equitable author compensation mechanisms.
📝 Abstract
The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.