About the job
As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. This requires good engineering (for example designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code (surprisingly difficult!), and acquiring deep knowledge of the performance of supercomputers. We’re looking for people who love optimizing performance, understanding distributed systems, and who cannot stand having bugs in their code.
Responsibilities
Collaborate with researchers to enable them to develop systems-efficient video models and architectures
Apply the latest techniques to our internal training framework to achieve impressive hardware efficiency for our training runs
Profile and optimize our training framework
Qualifications
Minimum
Have experience working with multi-modal ML pipelines
Love diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability
Have strong software engineering skills and are proficient in Python.
Have experience understanding and optimizing training kernels
Are passionate about understanding stable training dynamics
Preferred
Have experience working with multi-modal ML pipelines
Love diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability
Have strong software engineering skills and are proficient in Python.
Have experience understanding and optimizing training kernels
Are passionate about understanding stable training dynamics