About the job
Microsoft AI is looking for a Member of Technical Staff – Capacity & Efficiency Infrastructure, to help us improve manage, and improve the efficiency of, our compute fleet. We’re seeking someone who brings an abundance of positive energy, empathy, and kindness to the team every day, in addition to being highly effective. The ideal candidate enjoys building world-class consumer experiences and products in a fast-paced environment. You will actively contribute to the development of AI models powering our innovative products. Expect to wear multiple hats and work across engineering, research, and everything in between. Your contributions will span model architecture, data curation, training and inference infrastructure, evaluation protocols, alignment and reinforcement learning from human feedback (RLHF), and many other exciting topics at the cutting edge of AI.
Responsibilities
Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters.
Build and evolve telemetry systems to provide visibility into infrastructure & ML model performance, utilization, and cost related metrics
Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
Drive architectural improvements across various ML services which deliver measurable efficiency improvements
Build and evolve tools to automatically provide insights and recommendations to improve fleet-wide efficiency
Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
Partner with ML researchers and infrastructure engineers to understand their plans and future needs and develop plans to balance growth with efficiency
Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, MAIA, and beyond)
Embody our Culture and Values.
Qualifications
Minimum
Bachelor’s Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Preferred
Bachelor’s Degree in Computer Science or related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C++ or Python OR Master’s Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C++ or Python OR equivalent experience
Deep understanding of the fundamentals of GPU architectures and DL/LLM architectures
Deep experience in profiling and analyzing performance in large-scale distributed computing systems.
Deep experience in profiling and analyzing performance in ML models especially GenAI models
Experience with low-level GPU programming (CUDA, Triton, NCCL) and frameworks such as PyTorch or JAX.
Experience in leading technical projects and supporting architectural decisions with data.
Experience building infrastructure for large-scale machine learning or generative AI workloads.
Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms.
Track record of contributing to high-performance computing or large-scale AI infrastructure projects.