About the job
As an ML Systems Engineer on our Reinforcement Learning Engineering team, you'll be responsible for the critical algorithms and infrastructure that our researchers depend on to train models. Your work will directly enable breakthroughs in AI capabilities and safety. You'll focus obsessively on improving the performance, robustness, and usability of these systems so our research can progress as quickly as possible.
Responsibilities
Build, maintain, and improve the algorithms and systems used by finetuning researchers to train models; improve the speed, reliability, and ease-of-use of finetuning systems; profile reinforcement learning pipeline to find opportunities for improvement; build systems that regularly launch training jobs in test environments; adapt finetuning systems for new model architectures; build instrumentation to detect and eliminate Python GIL contention; diagnose and fix training slowdowns; implement stable, fast versions of new training algorithms proposed by researchers.
Qualifications
Minimum
Have 4+ years of software engineering experience; Like working on systems and tools that make other people more productive; Are results-oriented, with a bias towards flexibility and impact; Pick up slack, even if it goes outside your job description; Enjoy pair programming; Want to learn more about machine learning research; Care about the societal impacts of your work.
Preferred
High performance, large scale distributed systems; Large scale LLM training; Python; Implementing LLM finetuning algorithms, such as RLHF.