Has made substantial contributions to several papers, including 'Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time', 'Easily steer OOD generalisation by adding one line to training data', 'Emergent Misalignment: Narrow Finetuning can lead to Broad Misalignment', 'Models finetuned to write insecure code learn to admire Nazis', 'Analyzing the Generalization and Reliability of Steering Vectors' (accepted at NeurIPS 2024), 'Towards Generalist Robot Learning from Internet Video: A Survey' (in proceedings, JAIR).
Research Experience
Posts frequent updates on LessWrong and Twitter.
Education
Currently completing MATS 7.0 with Owain Evans. A PhD student at University College London, supervised by Paige Brooks. Supported by the Agency for Science, Technology and Research (A*STAR).
Background
Has a broad interest in AI alignment and AGI risk. Current focus is on understanding and evaluating the legibility of models' chain-of-thought reasoning. Also interested in steganography, prosaic interpretability, and alignment failure modes.
Miscellany
Personal interests include sharing updates on LessWrong and Twitter.