About the job
This role leads the Automated Red Teaming (ART) effort: building scalable, research-driven systems that continuously uncover failure modes in our models and safeguards, and translate those findings into actionable, production-facing improvements. The goal is to reduce expected harm by finding the highest-leverage, least-covered weaknesses early and reliably.
Responsibilities
- Own the research and technical direction for automated red teaming across catastrophic risk areas, with an initial emphasis on:
- Automated classifier jailbreak discovery (cyber and bio).
- Automated bio threat-development elicitation (worst-feasible planning uplift).
- CoT monitoring evasion probing (and adjacent loss-of-control evaluations).
- Partner closely with:
- Vertical risk teams (Cyber, Bio, Loss of Control) to define threat models, prioritize targets, and land mitigations.
- The Classifiers team to turn discovered attacks into training data, evals, and measurable robustness gains.
- Product / Engineering / Safety stakeholders to ensure ART outputs are operationally useful.
Qualifications
Minimum
- Feel a strong pull toward AI safety, and you’re motivated by reducing real-world catastrophic risk (not just publishing cool results).
- Love breaking systems (responsibly) — you get energy from finding weird, high-severity failure modes and turning them into concrete fixes.
- Have strong applied research instincts, especially around evaluations: you’re good at designing experiments that are reproducible, interpretable, and hard to fool.
- Bring hands-on experience with LLMs and agents, including multi-turn behaviors, tool use, and the ways models adapt to constraints.
- Are comfortable building scalable automation, not just prototypes — you can turn red-teaming ideas into pipelines that run continuously and produce high-signal outputs.
- Have solid software engineering fundamentals (data structures, algorithms, testing discipline) and you can work effectively in a production-adjacent environment.
- Think in threat models and incentives, and you naturally ask “what would an attacker do next?” or “how would this fail under pressure?”
- Can translate messy findings into action, communicating clearly with researchers, engineers, product, and policy — and driving alignment on what to fix first.
- Care about efficiency and prioritization, and you’re happy to say “no” to low-leverage work to focus on what moves the risk needle.
Preferred
- Experience in adversarial ML, security research / red teaming, abuse prevention systems, or large-scale eval infrastructure.