Researcher, Automated Red Teaming

About the job

This role leads the Automated Red Teaming (ART) effort: building scalable, research-driven systems that continuously uncover failure modes in our models and safeguards, and translate those findings into actionable, production-facing improvements. The goal is to reduce expected harm by finding the highest-leverage, least-covered weaknesses early and reliably.

Responsibilities

- Own the research and technical direction for automated red teaming across catastrophic risk areas, with an initial emphasis on:

- Automated classifier jailbreak discovery (cyber and bio).

- Automated bio threat-development elicitation (worst-feasible planning uplift).

- CoT monitoring evasion probing (and adjacent loss-of-control evaluations).

- Partner closely with:

- Vertical risk teams (Cyber, Bio, Loss of Control) to define threat models, prioritize targets, and land mitigations.

- The Classifiers team to turn discovered attacks into training data, evals, and measurable robustness gains.

- Product / Engineering / Safety stakeholders to ensure ART outputs are operationally useful.

Qualifications

Minimum

- Feel a strong pull toward AI safety, and you’re motivated by reducing real-world catastrophic risk (not just publishing cool results).

- Love breaking systems (responsibly) — you get energy from finding weird, high-severity failure modes and turning them into concrete fixes.

- Have strong applied research instincts, especially around evaluations: you’re good at designing experiments that are reproducible, interpretable, and hard to fool.

- Bring hands-on experience with LLMs and agents, including multi-turn behaviors, tool use, and the ways models adapt to constraints.

- Are comfortable building scalable automation, not just prototypes — you can turn red-teaming ideas into pipelines that run continuously and produce high-signal outputs.

- Have solid software engineering fundamentals (data structures, algorithms, testing discipline) and you can work effectively in a production-adjacent environment.

- Think in threat models and incentives, and you naturally ask “what would an attacker do next?” or “how would this fail under pressure?”

- Can translate messy findings into action, communicating clearly with researchers, engineers, product, and policy — and driving alignment on what to fix first.

- Care about efficiency and prioritization, and you’re happy to say “no” to low-leverage work to focus on what moves the risk needle.

Preferred

- Experience in adversarial ML, security research / red teaming, abuse prevention systems, or large-scale eval infrastructure.