Scholar
Joe Benton
Google Scholar ID: ywp_eYsAAAAJ
Anthropic
Machine Learning
Statistics
Follow
Homepage
↗
Google Scholar
↗
Citations & Impact
All-time
Citations
1,092
H-index
12
i10-index
12
Publications
16
Co-authors
10
list available
Contact
No contact links provided.
Publications
10 items
Natural Emergent Misalignment from Reward Hacking in Production RL
2025
Cited
0
Optimizing AI Agent Attacks With Synthetic Data
2025
Cited
0
Evaluating Control Protocols for Untrusted AI Agents
2025
Cited
0
Inverse Scaling in Test-Time Compute
2025
Cited
0
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
2025
Cited
0
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
2025
Cited
0
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
2025
Cited
0
Reasoning Models Don't Always Say What They Think
2025
Cited
7
Load more
Resume (English only)
Co-authors
10 total
Arnaud Doucet
Google DeepMind
George (Yorgos) Deligiannidis
Professor of Statistics, University of Oxford
Valentin De Bortoli
Google DeepMind, London
Buck Shlegeris
CEO, Redwood Research
Co-author 5
Tom Rainforth
Associate Professor, University of Oxford
Co-author 7
Adam Scherlis
Interpretability Researcher
×
Welcome back
Sign in to Agora
Welcome back! Please sign in to continue.
Email address
Password
Forgot password?
Continue
Do not have an account?
Sign up