Scholar

Dylan Hadfield-Menell

Google Scholar ID: 4mVPFQ8AAAAJ

Massachusetts Institute of Technology

Artificial Intelligence

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

5,039

H-index

34

i10-index

56

Publications

20

Co-authors

28

list available

Contact

CVOpen ↗TwitterOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

17 items

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

2026

Cited

0

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

2026

Cited

0

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

2026

Cited

0

The Prosocial Ranking Challenge: Reducing Polarization on Social Media without Sacrificing Engagement

2026

Cited

0

Prompt Injection as Role Confusion

2026

Cited

0

Surgical Activation Steering via Generative Causal Mediation

2026

Cited

0

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

2025

Cited

2

Open-Universe Assistance Games

2025

Cited

0

Resume (English only)

Background

Associate Professor of EECS at MIT
Head of the Algorithmic Alignment Group at CSAIL
Research focuses on AI alignment—ensuring AI systems behave in accordance with human and societal values
Works on alignment challenges in multi-agent systems, human-AI teams, and societal oversight of machine learning
Aims to enable safe, beneficial, and trustworthy real-world deployment of AI

Co-authors

28 total

Assistant Professor at UC Berkeley // Director, AI Safety and Alignment, Google DeepMind

PhD student, MIT

UC Berkeley | Covariant

Gillian K. Hadfield

Johns Hopkins University, Dept of Computer Science and School of Government and Policy

Stanford University

Thomas L. Griffiths

Professor of Psychology and Computer Science, Princeton University