Interrogating LLM design under a fair learning doctrine

📅 2025-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current copyright disputes surrounding large language models (LLMs) predominantly hinge on “substantial similarity” of outputs—a legally ambiguous and algorithmically intractable standard that overlooks intrinsic risks arising from the training process itself. Method: This paper shifts focus to model architecture and training mechanisms, proposing a legally grounded “fair learning” standard. It operationalizes copyright law’s “substantial effect” as a quantifiable training-time causal effect, integrating causal inference (do-calculus, counterfactual analysis), memory probing techniques (retrieval-based probing, logit lens), and Pythia model reverse engineering to build an interdisciplinary assessment framework bridging law and machine learning. Contribution/Results: Empirical analysis reveals statistically significant causal impacts of key training decisions—including data deduplication and curriculum learning—on model memorization. This work constitutes the first technically rigorous operationalization of copyright legal elements, enabling traceable, evidence-based judicial attribution and establishing a scalable methodology for regulatory and litigation contexts.

Technology Category

Application Category

📝 Abstract
The current discourse on large language models (LLMs) and copyright largely takes a"behavioral"perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary"structural"perspective and shift our focus to how LLMs are trained. We operationalize a notion of"fair learning"by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM training under fair learning
Evaluating copyright risks in LLM design
Proposing legal standards for model memorization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural perspective on LLM training
Causal and correlational analyses
Legal standard for fair learning
🔎 Similar Papers
No similar papers found.
Johnny Tian-Zheng Wei
Johnny Tian-Zheng Wei
University of Southern California
Natural language processing
M
Maggie Wang
Princeton University, USA
A
Ameya Godbole
University of Southern California, USA
J
Jonathan H. Choi
University of Southern California, USA
Robin Jia
Robin Jia
University of Southern California
natural language processing