Scott Emmons
Scholar

Scott Emmons

Google Scholar ID: LoT0z6oAAAAJ
Google DeepMind
AI AlignmentAdversarial RobustnessInterpretabilityCooperative AI
Citations & Impact
All-time
Citations
1,497
 
H-index
15
 
i10-index
18
 
Publications
20
 
Co-authors
13
list available
Resume (English only)
Academic Achievements
  • A Pragmatic Way to Measure Chain-of-Thought Monitorability
  • Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
  • When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
  • An Approach to Technical AGI Safety and Security
  • Observation Interference in Partially Observable Assistance Games
  • Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
  • The Partially Observable Off-Switch Game
  • Obfuscated Activations Bypass LLM Latent-Space Defenses
  • When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
  • A StrongREJECT for Empty Jailbreaks
  • Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
  • Image Hijacks: Adversarial Images can Control Generative Models at Runtime
  • ALMANACS: A Simulatability Benchmark for Language Model Explainability
  • Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Research Experience
  • Interested in both the theory and practice of AI alignment. Helped characterize how RLHF can lead to deception when the AI sees more than the human, develop multimodal attacks and benchmarks for open-ended agents, and use mechanistic interpretability to find evidence of learned look-ahead in a chess-playing neural network.
Education
  • PhD, University of California, Berkeley, Center for Human-Compatible AI, Advisor: Stuart Russell.
Background
  • A research scientist at Google DeepMind focused on AI safety and alignment. Completed his PhD at UC Berkeley’s Center for Human-Compatible AI, advised by Stuart Russell. Previously co-founded far.ai, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.