Senior Researcher - AI and Systems Reliability - Microsoft Research

Microsoft
Redmond, WA, USA / San Francisco Bay area, USA / New York City metropolitan area, USA2025-12-16onsite

About the job

Help shape the future of reliable AI systems. At Microsoft Research’s AI and Systems Reliability Group (Redmond, WA), we push the boundaries of foundational research and turn ideas into impact across Microsoft and beyond. Our mission is to tackle ambitious challenges that redefine the computing landscape. We are seeking Senior Researcher – AI and Systems Reliability – Microsoft Research areas such as distributed systems and reliability, formal methods and verification, machine learning for system reliability, and reliability of machine learning systems. As AI (Artificial Intelligence) technologies—like large language models—become central to everyday computing, we look for experts who can bring formal rigor and reliability guarantees to AI-powered personal, mobile, and datacenter platforms. If you thrive in collaborative environments and are passionate about solving some of the world’s most important problems, we want to hear from you.

Responsibilities

As a Senior Researcher – AI and Systems Reliability – Microsoft Research you will define a novel research agenda, driving forward an effective program of basic, fundamental, and applied research. We highly value collaboration and building new ideas with members of the group and others. You have the direct opportunity to realize your ideas in products and services used worldwide.

Qualifications

Minimum

PhD (or currently pursuing) in Computer Science or Computer Science Engineering

Preferred

A research program demonstrated by journal and conference publications (NeurIPS, SOSP,OSDI); Firm understanding of Distributed Systems and Cloud Systems; Demonstrable ability to work in a multi-disciplinary team; Effective communication skills and ability to work in a collaborative environment; A PhD that was focused on any one of the following core areas of research: datacenter networking, distributed systems, formal methods and verification, high performance computing, ML Systems, operating systems, programming languages, storage systems, systems reliability, systems security and software engineering