Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Conditioning Predictive Models
An overview of 11 proposals for building safe advanced AI
Risks from Learned Optimization
Research Experience
Previously: MIRI, OpenAI
Background
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Miscellany
Personal interests: Engaging in discussions and encouraging people to apply for the Anthropic Fellows program, which is a safety-focused mentorship program.