🤖 AI Summary
This study addresses the validity challenges that arise when evaluating human-AI collaboration in high-stakes decision-making, where conventional causal inference assumptions—particularly those underlying randomized controlled trials (RCTs)—often misalign with the dynamic nature of cutting-edge AI systems, thereby compromising internal, external, and construct validity. Through in-depth interviews with 16 domain experts spanning biosafety, cybersecurity, education, and labor, the research employs qualitative analysis and methodological mapping to systematically identify and structure key validity threats inherent in applying RCTs to frontier AI evaluation. The work further proposes practical mitigation strategies aligned with distinct phases of the AI development lifecycle, delineates the boundaries within which evidence of human enhancement remains valid, and offers actionable methodological guidance for AI governance, deployment, and safety assessment.
📝 Abstract
Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying these studies are well-established, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between standard causal inference assumptions and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We synthesize these challenges across key stages of the human uplift research lifecycle and map them to practitioner-reported solutions, clarifying both the limits and the appropriate uses of evidence from human uplift studies in high-stakes decision-making.