Distributed Training Engineer, Sora

OpenAI
San Francisco, CA, USA2024-03-15

About the job

As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. This requires good engineering (for example designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code (surprisingly difficult!), and acquiring deep knowledge of the performance of supercomputers. We’re looking for people who love optimizing performance, understanding distributed systems, and who cannot stand having bugs in their code.

Responsibilities

Collaborate with researchers to enable them to develop systems-efficient video models and architectures

Apply the latest techniques to our internal training framework to achieve impressive hardware efficiency for our training runs

Profile and optimize our training framework

Qualifications

Minimum

Have experience working with multi-modal ML pipelines

Love diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability

Have strong software engineering skills and are proficient in Python.

Have experience understanding and optimizing training kernels

Are passionate about understanding stable training dynamics

Preferred

Have experience working with multi-modal ML pipelines

Love diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability

Have strong software engineering skills and are proficient in Python.

Have experience understanding and optimizing training kernels

Are passionate about understanding stable training dynamics