- Published paper 'Agentic Misalignment: How LLMs Could be Insider Threats' (2025)
- Contributed to 'Best-of-N Jailbreaking' (2024)
- Worked on 'Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs' (2024)
- Analyzed the generalization and reliability of steering vectors (2024)
- Explored eight methods to evaluate robust unlearning in LLMs (2024)
- Researched towards automated circuit discovery for mechanistic interpretability (2023), which was a spotlight at NeurIPS 2023
- Developed Spawrious: A benchmark for fine control of spurious correlation biases (2023)
- Wrote a survey on causal machine learning and open problems (2022)
Research Experience
- PhD student at UCL focusing on AI Alignment
- Contract researcher with Anthropic
- Former MATS scholar with Stephen Casper
Education
PhD student at UCL, supervised by Stephen Casper. Specific degree, major, and time period not provided.
Background
Research interests include AI alignment, mechanistic interpretability, and AI safety. He works on finding and fixing ways AI systems can fail, particularly in preventing AI systems from engaging in harmful behaviors.
Miscellany
Currently based in San Francisco; involved with Entrepreneurs First