When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
A StrongREJECT for Empty Jailbreaks
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Research Experience
Interested in both the theory and practice of AI alignment. Helped characterize how RLHF can lead to deception when the AI sees more than the human, develop multimodal attacks and benchmarks for open-ended agents, and use mechanistic interpretability to find evidence of learned look-ahead in a chess-playing neural network.
Education
PhD, University of California, Berkeley, Center for Human-Compatible AI, Advisor: Stuart Russell.
Background
A research scientist at Google DeepMind focused on AI safety and alignment. Completed his PhD at UC Berkeley’s Center for Human-Compatible AI, advised by Stuart Russell. Previously co-founded far.ai, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.