🤖 AI Summary
Current copyright disputes surrounding large language models (LLMs) predominantly hinge on “substantial similarity” of outputs—a legally ambiguous and algorithmically intractable standard that overlooks intrinsic risks arising from the training process itself.
Method: This paper shifts focus to model architecture and training mechanisms, proposing a legally grounded “fair learning” standard. It operationalizes copyright law’s “substantial effect” as a quantifiable training-time causal effect, integrating causal inference (do-calculus, counterfactual analysis), memory probing techniques (retrieval-based probing, logit lens), and Pythia model reverse engineering to build an interdisciplinary assessment framework bridging law and machine learning.
Contribution/Results: Empirical analysis reveals statistically significant causal impacts of key training decisions—including data deduplication and curriculum learning—on model memorization. This work constitutes the first technically rigorous operationalization of copyright legal elements, enabling traceable, evidence-based judicial attribution and establishing a scalable methodology for regulatory and litigation contexts.
📝 Abstract
The current discourse on large language models (LLMs) and copyright largely takes a"behavioral"perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary"structural"perspective and shift our focus to how LLMs are trained. We operationalize a notion of"fair learning"by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.