🤖 AI Summary
This study investigates the role of Training Data Attribution (TDA) in mitigating extreme risks posed by large language models (LLMs). Addressing critical gaps—namely, unclear technical feasibility of TDA in state-of-the-art models, poorly understood deployment barriers, and insufficient policy alignment—the project conducts the first systematic assessment of TDA’s technological maturity, industry incentive structures, data disclosure constraints, and policy-enabling pathways. Methodologically, it integrates reverse inference, influence function analysis, data provenance modeling, and policy-technology co-analysis to establish a multi-stakeholder collaborative framework. Results demonstrate that TDA robustly supports key governance applications—including copyright compliance, safety auditing, and accountability tracing—and quantitatively substantiate its significant marginal reduction in AI systemic risk. The work advances a novel, verifiable, and accountable AI governance paradigm grounded in empirically validated TDA capabilities.
📝 Abstract
This report investigates Training Data Attribution (TDA) and its potential importance to and tractability for reducing extreme risks from AI. First, we discuss the plausibility and amount of effort it would take to bring existing TDA research efforts from their current state, to an efficient and accurate tool for TDA inference that can be run on frontier-scale LLMs. Next, we discuss the numerous research benefits AI labs will expect to see from using such TDA tooling. Then, we discuss a key outstanding bottleneck that would limit such TDA tooling from being accessible publicly: AI labs' willingness to disclose their training data. We suggest ways AI labs may work around these limitations, and discuss the willingness of governments to mandate such access. Assuming that AI labs willingly provide access to TDA inference, we then discuss what high-level societal benefits you might see. We list and discuss a series of policies and systems that may be enabled by TDA. Finally, we present an evaluation of TDA's potential impact on mitigating large-scale risks from AI systems.