About the job
On the Frontier Systems team, you’ll build critical infrastructure that keeps our supercomputers running reliably for cutting-edge AI research. Even a single hardware failure can derail a large-scale training run, so minimizing disruptions is core to the mission. Engineers here own their work end-to-end and are trusted to make a real impact. This role is for someone who goes deep - who thrives on root-causing system-level issues and building automation to catch and fix problems at scale.
Responsibilities
- Own and improve the system health checks that keep our hyperscale supercomputers stable during model training.
- Lead deep dives into hardware failures and system-level bugs to understand how things break at scale.
- Build automation that monitors and fixes issues across thousands of machines - so researchers can keep moving without interruption.
Qualifications
Minimum
- 7+ years of industry experience in software engineering
- Proficiency with Python and shell scripting
- A high degree of comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool necessary
- Experience developing reproducible analyses
- A balance of strengths in building and operationalizing
Preferred
- Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)
- Experience with visualization of large data centers and networks.
- Expertise with network operations and tooling
- Expertise with power management and stabilization