Scholar
Henry Sleight
Google Scholar ID: FRHn0z4AAAAJ
Research Manager, Anthropic Fellows Program, Program Manager, Constellation
AI Safety
Adversarial Robustness
Model Organisms of Misalignment
Follow
Homepage
↗
Google Scholar
↗
Citations & Impact
All-time
Citations
395
H-index
9
i10-index
9
Publications
19
Co-authors
3
list available
Contact
No contact links provided.
Publications
15 items
Abstractive Red-Teaming of Language Model Character
2026
Cited
1
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
2026
Cited
0
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
2025
Cited
0
Evaluating Control Protocols for Untrusted AI Agents
2025
Cited
0
Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
2025
Cited
0
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
2025
Cited
0
Stress-Testing Model Specs Reveals Character Differences among Language Models
2025
Cited
0
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
2025
Cited
0
Load more
Resume (English only)
Co-authors
3 total
Ethan Perez
Anthropic
John Hughes
Anthropic
Rylan Schaeffer
Stanford University
×
Welcome back
Sign in to Agora
Welcome back! Please sign in to continue.
Email address
Password
Forgot password?
Continue
Do not have an account?
Sign up