Software Development Engineer II, AI/ML Elastic Collectives

About the job

We are seeking an experienced engineer to work on distributed AI/ML systems. This role involves working on collective operations - the fundamental operations that enable AI to scale across multiple accelerators & servers. Most of our stack is C/C++ and relatively low level, so solid knowledge of Linux, kernels, and performant code is important. Experience with embedded systems is valued, and experience with high-speed networking or HPC interconnects is valued highly.

Responsibilities

Work on distributed AI/ML systems; develop collective operations enabling AI to scale across multiple accelerators and servers; build networking solutions for Machine Learning (ML) and High-Performance Computing (HPC) workloads on AWS; collaborate with infrastructure experts, hardware engineers, RTL engineers, scientists, and architects.

Qualifications

Minimum

3+ years of non-internship professional software development experience

2+ years of non-intternship design or architecture (design patterns, reliability and scaling) of new and existing systems experience

Experience programming with at least one software programming language

Knowledge of Linux fundamentals

Preferred