Scholar

Evan Hubinger

Google Scholar ID: LRivg1cAAAAJ

Member of Technical Staff, Anthropic

AGI Safety

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

2,514

H-index

19

i10-index

26

Publications

20

Co-authors

0

Contact

Emailevanjhub@gmail.com CVOpen ↗

Publications

6 items

Natural Emergent Misalignment from Reward Hacking in Production RL

2025

Cited

0

Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

2025

Cited

0

Agentic Misalignment: How LLMs Could Be Insider Threats

2025

Cited

0

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

2025

Cited

0

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

2025

Cited

0

Auditing language models for hidden objectives

2025

Cited

0

Resume (English only)

Academic Achievements

Auditing language models for hidden objectives
Alignment faking in large language models
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Conditioning Predictive Models
An overview of 11 proposals for building safe advanced AI
Risks from Learned Optimization

Research Experience

Previously: MIRI, OpenAI

Background

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Miscellany

Personal interests: Engaging in discussions and encouraging people to apply for the Anthropic Fellows program, which is a safety-focused mentorship program.

Co-authors

0 total

Co-authors: 0 (list not available)