Published 'Subliminal Learning: LLMs transmit behavioral traits via hidden signals in data' – showing LLMs can transmit traits through hidden data signals
Published 'Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs' – demonstrating generalization of narrow misalignment to harmful behaviors
Developed the Situational Awareness Dataset (SAD): the first large-scale, multi-task benchmark for situational awareness in LLMs (7 task categories, 12,000+ questions)
Proposed 'The Reversal Curse': LLMs trained on 'A is B' fail to infer 'B is A'
Created TruthfulQA benchmark: reveals that larger models are more likely to mimic human falsehoods
Developed a lie detector for black-box LLMs using a fixed set of unrelated questions
Maintains active research communication via blogs, Twitter, and LessWrong
Research Experience
Director of the Truthful AI research group in Berkeley
Previously worked on AI Alignment at the Future of Humanity Institute (FHI), University of Oxford
Former researcher at Ought, currently serves on its Board of Directors
Collaborates with colleagues such as James Chua on AI safety research
Offers the Astra Fellowship for 6-month research stays in Berkeley, with potential conversion to full-time roles