Member of Technical Staff, Capacity & Efficiency Infrastructure

About the job

Microsoft AI is looking for a Member of Technical Staff – Capacity & Efficiency Infrastructure, to help us improve manage, and improve the efficiency of, our compute fleet. We’re seeking someone who brings an abundance of positive energy, empathy, and kindness to the team every day, in addition to being highly effective. The ideal candidate enjoys building world-class consumer experiences and products in a fast-paced environment. You will actively contribute to the development of AI models powering our innovative products. Expect to wear multiple hats and work across engineering, research, and everything in between. Your contributions will span model architecture, data curation, training and inference infrastructure, evaluation protocols, alignment and reinforcement learning from human feedback (RLHF), and many other exciting topics at the cutting edge of AI.

Responsibilities

Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters.

Build and evolve telemetry systems to provide visibility into infrastructure & ML model performance, utilization, and cost related metrics

Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems

Drive architectural improvements across various ML services which deliver measurable efficiency improvements

Build and evolve tools to automatically provide insights and recommendations to improve fleet-wide efficiency

Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies

Partner with ML researchers and infrastructure engineers to understand their plans and future needs and develop plans to balance growth with efficiency

Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, MAIA, and beyond)

Embody our Culture and Values.

Qualifications

Minimum

Bachelor’s Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Preferred

Bachelor’s Degree in Computer Science or related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C++ or Python OR Master’s Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C++ or Python OR equivalent experience

Deep understanding of the fundamentals of GPU architectures and DL/LLM architectures

Deep experience in profiling and analyzing performance in large-scale distributed computing systems.

Deep experience in profiling and analyzing performance in ML models especially GenAI models

Experience with low-level GPU programming (CUDA, Triton, NCCL) and frameworks such as PyTorch or JAX.

Experience in leading technical projects and supporting architectural decisions with data.

Experience building infrastructure for large-scale machine learning or generative AI workloads.

Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms.