Scholar
Jan Leike
Google Scholar ID: beiWcokAAAAJ
Anthropic
reinforcement learning
deep learning
agent alignment
Follow
Homepage
↗
Google Scholar
↗
Citations & Impact
All-time
Citations
57,951
H-index
28
i10-index
37
Publications
20
Co-authors
20
list available
Contact
No contact links provided.
Publications
9 items
Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks
arXiv.org · 2026
Cited
2
Excess Description Length of Learning Generalizable Predictors
arXiv.org · 2026
Cited
0
Natural Emergent Misalignment from Reward Hacking in Production RL
2025
Cited
0
Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games
2025
Cited
0
Unsupervised Elicitation of Language Models
2025
Cited
0
Reasoning Models Don't Always Say What They Think
2025
Cited
7
Auditing language models for hidden objectives
2025
Cited
0
Forecasting Rare Language Model Behaviors
2025
Cited
0
Load more
Resume (English only)
Co-authors
20 total
Jeffrey Wu
Anthropic AI, OpenAI
Co-author 2
John Schulman
Thinking Machines
Co-author 4
Co-author 5
Co-author 6
Marcus Hutter
Researcher@DeepMind & Professor at ANU
David Scott Krueger
Assistant Professor, University of Montreal, Mila
×
Welcome back
Sign in to Agora
Welcome back! Please sign in to continue.
Email address
Password
Forgot password?
Continue
Do not have an account?
Sign up