Scholar
Samuel Marks
Google Scholar ID: fW7yK10AAAAJ
Anthropic
large language models
AI safety
oversight
interpretability
Follow
Homepage
↗
Google Scholar
↗
Citations & Impact
All-time
Citations
2,043
H-index
13
i10-index
14
Publications
20
Co-authors
0
Contact
No contact links provided.
Publications
19 items
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
2026
Cited
0
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
2026
Cited
0
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
2025
Cited
0
Auditing Games for Sandbagging
2025
Cited
0
Unsupervised decoding of encoded reasoning using language model interpretability
2025
Cited
0
Liars' Bench: Evaluating Lie Detectors for Language Models
2025
Cited
0
Steering Evaluation-Aware Language Models To Act Like They Are Deployed
2025
Cited
0
Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
2025
Cited
0
Load more
Resume (English only)
Co-authors
0 total
Co-authors: 0 (list not available)
×
Welcome back
Sign in to Agora
Welcome back! Please sign in to continue.
Email address
Password
Forgot password?
Continue
Do not have an account?
Sign up