About the job
Lead applied ML engineering on Scale's Applied ML team, powering data infrastructure for leading agentic LLMs (ChatGPT, Gemini, Llama). You will build scalable multi-agent systems to validate agentic reasoning and behaviors, scale human expertise, and drive research into real-world agent reliability failures despite strong benchmarks, shipping production fixes.
Responsibilities
Build and deploy multi-agent systems for agentic reasoning validation
Develop pipelines to detect errors and scale human judgment
Combine classical ML, LLMs, and multi-agent techniques for reliability
Lead research into agent failure modes and ship fixes
Use AI tools to speed prototyping and iteration
Build data-driven evaluations and deploy rapid improvements
Integrate systems into Scale's platform
Qualifications
Minimum
PhD or MSc in Computer Science, Mathematics, Statistics, or related field
3+ years shipping scaled production ML systems
Demonstrated real-world impact
Mastery of PyTorch, TensorFlow, JAX, or scikit-learn
Deep expertise in agentic LLMs and multi-agent systems
Strong software engineering and microservices (AWS/GCP)
Rapid, data-driven iteration
Proficiency using AI tools to accelerate work
Strong research depth with practical bias
Excellent cross-functional communication
Preferred
Experience prototyping agent evaluation/reliability systems
Human-in-the-loop or annotation pipeline work
Open-source contributions in agents, evaluation, or alignment
Publications on agent reliability (NeurIPS, ICML, ICLR)