Research Intern - AI Evaluation and Alignment

About the job

Research Interns at Microsoft provide a dynamic environment for research careers with a network of world-class research labs led by globally-recognized scientists and engineers, who pursue innovation in a range of scientific and technical disciplines to help solve complex challenges in diverse fields, including computing, healthcare, economics, and the environment. Microsoft Research and Copilot Studio team are seeking Research Interns to help advance the quality, reliability, and evaluation of Large Language Model (LLM)-based systems. Research Interns will collaborate with applied scientists and engineers to explore new machine learning methods that improve how Artificial Intelligence (AI) systems assess and align with human expectations.

Responsibilities

Research Interns put inquiry and theory into practice. Alongside fellow doctoral candidates and some of the world’s best researchers, Research Interns learn, collaborate, and network for life. Research Interns not only advance their own careers, but they also contribute to exciting research and development strides. During the 12-week internship, Research Interns are paired with mentors and expected to collaborate with other Research Interns and researchers, present findings, and contribute to the vibrant life of the community. Co-developing a research project in collaboration with the supervisor and research mentors. Designing and implementing machine learning approaches, including training and fine-tuning using real-world datasets. Developing evaluation frameworks and benchmarking methods to assess model quality, robustness, and generalization. Presentation and communication of research findings

Qualifications

Minimum

Currently enrolled in a PhD program in Statistics, Computer Science, Physics, Operations Research, or a related technical field. At least 1 year of hands-on experience working on LLM-related projects (e.g., prompt engineering, building and evaluating LLM-based systems, rewards modeling etc.). At least 1 year of experience coding in Python.

Preferred

Prior experience in reward models for large language models or LLM-as-a-Judge. Strong experience with deep learning frameworks (e.g., PyTorch, TensorFlow) and familiarity with software engineering best practices (e.g. git). Experience with LLM post-training and evaluation or LLM-based judge systems. Research experience demonstrated through publications or projects. Ability to work independently in ambiguous or rapidly evolving situations and collaborate effectively across disciplines.