Scholar

Daniel Tan

Google Scholar ID: QKO1QacAAAAJ

UCL

AlignmentMLRobotics

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

165

H-index

i10-index

Publications

Co-authors

list available

Contact

TwitterOpen ↗LinkedInOpen ↗

Publications

8 items

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

2026

Cited

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

2025

Cited

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

2025

Cited

LISA Technical Report: An Agentic Framework for Smart Contract Auditing

2025

Cited

Emergent misalignment as prompt sensitivity: A research note

2025

Cited

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

2025

Cited

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

2025

Cited

Analyzing the Generalization and Reliability of Steering Vectors

arXiv.org · 2024

Cited

Resume (English only)

Academic Achievements

Has made substantial contributions to several papers, including 'Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time', 'Easily steer OOD generalisation by adding one line to training data', 'Emergent Misalignment: Narrow Finetuning can lead to Broad Misalignment', 'Models finetuned to write insecure code learn to admire Nazis', 'Analyzing the Generalization and Reliability of Steering Vectors' (accepted at NeurIPS 2024), 'Towards Generalist Robot Learning from Internet Video: A Survey' (in proceedings, JAIR).

Research Experience

Posts frequent updates on LessWrong and Twitter.

Education

Currently completing MATS 7.0 with Owain Evans. A PhD student at University College London, supervised by Paige Brooks. Supported by the Agency for Science, Technology and Research (A*STAR).

Background

Has a broad interest in AI alignment and AGI risk. Current focus is on understanding and evaluating the legibility of models' chain-of-thought reasoning. Also interested in steganography, prosaic interpretability, and alignment failure modes.

Miscellany