Scholar

Scott Emmons

Google Scholar ID: LoT0z6oAAAAJ

Google DeepMind

AI AlignmentAdversarial RobustnessInterpretabilityCooperative AI

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

1,497

H-index

i10-index

Publications

Co-authors

list available

Contact

CVOpen ↗GitHubOpen ↗

Publications

9 items

Exploration Hacking: Can LLMs Learn to Resist RL Training?

2026

Cited

Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies

2026

Cited

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

2025

Cited

A Pragmatic Way to Measure Chain-of-Thought Monitorability

2025

Cited

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

2025

Cited

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

2025

Cited

Observation Interference in Partially Observable Assistance Games

arXiv.org · 2024

Cited

Obfuscated Activations Bypass LLM Latent-Space Defenses

arXiv.org · 2024

Cited

Resume (English only)

Academic Achievements

A Pragmatic Way to Measure Chain-of-Thought Monitorability
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
An Approach to Technical AGI Safety and Security
Observation Interference in Partially Observable Assistance Games
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
The Partially Observable Off-Switch Game
Obfuscated Activations Bypass LLM Latent-Space Defenses
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
A StrongREJECT for Empty Jailbreaks
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Research Experience

Interested in both the theory and practice of AI alignment. Helped characterize how RLHF can lead to deception when the AI sees more than the human, develop multimodal attacks and benchmarks for open-ended agents, and use mechanistic interpretability to find evidence of learned look-ahead in a chess-playing neural network.

Education

PhD, University of California, Berkeley, Center for Human-Compatible AI, Advisor: Stuart Russell.

Background

A research scientist at Google DeepMind focused on AI safety and alignment. Completed his PhD at UC Berkeley’s Center for Human-Compatible AI, advised by Stuart Russell. Previously co-founded far.ai, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.

Co-authors

13 total