AgoraResearch hub
ExploreLibraryProfile
Account
Joe Benton
Scholar

Joe Benton

Google Scholar ID: ywp_eYsAAAAJ
Anthropic
Machine LearningStatistics
Homepage↗Google Scholar↗
Citations & Impact
All-time
Citations
1,092
 
H-index
12
 
i10-index
12
 
Publications
16
 
Co-authors
10
list available
Contact
No contact links provided.
Publications
10 items
Natural Emergent Misalignment from Reward Hacking in Production RL
2025
Cited
0
Optimizing AI Agent Attacks With Synthetic Data
2025
Cited
0
Evaluating Control Protocols for Untrusted AI Agents
2025
Cited
0
Inverse Scaling in Test-Time Compute
2025
Cited
0
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
2025
Cited
0
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
2025
Cited
0
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
2025
Cited
0
Reasoning Models Don't Always Say What They Think
2025
Cited
7
Resume (English only)
Co-authors
10 total
Arnaud Doucet
Arnaud Doucet
Google DeepMind
George (Yorgos) Deligiannidis
George (Yorgos) Deligiannidis
Professor of Statistics, University of Oxford
Valentin De Bortoli
Valentin De Bortoli
Google DeepMind, London
Buck Shlegeris
Buck Shlegeris
CEO, Redwood Research
Co-author 5
Co-author 5
Tom Rainforth
Tom Rainforth
Associate Professor, University of Oxford
Co-author 7
Co-author 7
Adam Scherlis
Adam Scherlis
Interpretability Researcher

Welcome back

Sign in to Agora

Welcome back! Please sign in to continue.

Do not have an account?