"Human Control: Definitions and Algorithms" (UAI 2023): Studies definitions of human control (e.g., corrigibility, alignment), their guarantees for human autonomy, and associated algorithms.
"Reasoning about Causality in Games" (Artificial Intelligence Journal 2023): Introduces structural causal games as a unified framework for causal and game-theoretic reasoning.
"Path-Specific Objectives for Safer Agent Incentives" (AAAI 2022): Addresses how to optimize objectives without undesirable means (e.g., user manipulation).
"A Complete Criterion for Value of Information in Soluble Influence Diagrams" (AAAI 2022): Provides a complete graphical criterion for value of information in multi-decision influence diagrams.
"Why Fair Labels Can Yield Unfair Predictions" (AAAI 2022): Shows how unfairness can be incentivized even with perfectly fair labels, with graphical conditions.
"Agent Incentives: A Causal Perspective" (AAAI 2021): Presents sound and complete graphical criteria for four types of agent incentives.
"Incorrigibility in the CIRL Framework" (AIES 2018): Analyzes how Cooperative Inverse Reinforcement Learning may fail to prevent incorrigible behavior.
Research Experience
Research Fellow at the Future of Humanity Institute.
Research Intern at DeepMind.
Research Intern at OpenAI.
Founder of the EA Forum (Effective Altruism Forum).
Co-founder of the Causal Incentives Working Group, which applies causal models to AI safety.