About the job
We are seeking an experienced engineer to work on distributed AI/ML systems. This role involves working on collective operations - the fundamental operations that enable AI to scale across multiple accelerators & servers. Most of our stack is C/C++ and relatively low level, so solid knowledge of Linux, kernels, and performant code is important. Experience with embedded systems is valued, and experience with high-speed networking or HPC interconnects is valued highly.
Responsibilities
Work on distributed AI/ML systems; develop collective operations enabling AI to scale across multiple accelerators and servers; build networking solutions for Machine Learning (ML) and High-Performance Computing (HPC) workloads on AWS; collaborate with infrastructure experts, hardware engineers, RTL engineers, scientists, and architects.
Qualifications
Minimum
3+ years of non-internship professional software development experience
2+ years of non-intternship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience programming with at least one software programming language
Knowledge of Linux fundamentals
Preferred
Experience with embedded systems is valued, and experience with high-speed networking or HPC interconnects is valued highly.