Software Engineer, Frontier Systems

About the job

On the Frontier Systems team, you’ll build critical infrastructure that keeps our supercomputers running reliably for cutting-edge AI research. Even a single hardware failure can derail a large-scale training run, so minimizing disruptions is core to the mission. Engineers here own their work end-to-end and are trusted to make a real impact. This role is for someone who goes deep - who thrives on root-causing system-level issues and building automation to catch and fix problems at scale.

Responsibilities

- Own and improve the system health checks that keep our hyperscale supercomputers stable during model training.

- Lead deep dives into hardware failures and system-level bugs to understand how things break at scale.

- Build automation that monitors and fixes issues across thousands of machines - so researchers can keep moving without interruption.

Qualifications

Minimum

- 7+ years of industry experience in software engineering

- Proficiency with Python and shell scripting

- A high degree of comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool necessary

- Experience developing reproducible analyses

- A balance of strengths in building and operationalizing

Preferred

- Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)

- Experience with visualization of large data centers and networks.

- Expertise with network operations and tooling

- Expertise with power management and stabilization