Sr. Research Manager, Evaluation Science

About the job

AI systems are only as trustworthy as the methods used to evaluate them. At Apple, where AI powers experiences for billions of people, getting evaluation right is not a support function. It is a foundational science. As these systems grow in complexity, the quality of our products is increasingly constrained by the quality of our evaluation methods. Our team is building the scientific foundation and self-service tools for how AI evaluation is done at scale, spanning LLMs, agentic systems, and human-AI interaction. We don’t just publish methods; we productionize them. We are looking for a Sr. Research Manager to lead an ML research team that advances the state-of-the-art in evaluation methods that can be shipped as production tools for Apple developers and published in top venues.

Responsibilities

Set and execute the research strategy, defining a balanced portfolio and making clear investment decisions between near-term capability work and longer-term scientific bets.

Build and lead an ML research team advancing evaluation methods at the frontier. Recruit, develop, and retain topresearch talent through a compelling research vision, strong mentorship, and a culture that values both publication and the translation of novel methods into working systems.

Maintain a strong external research presence at top venues while delivering evaluation capabilities that are adopted internally. Design research projects scoped from inception for productionization.

Partner with platform engineering and applied science partners to translate research into self-service evaluation infrastructure. The goal is not just a working system but rather an abstraction that an Apple engineer without a research background can apply correctly. Collaborate on the design of evaluation SDKs and APIs with that end user in mind from the start.

Identify the evaluation problems most worth solving across Apple and ensure your team's work is designed to address them. The goal is not just research outputs but capabilities that other teams can adopt without needing your team to operate them.

Create an environment where interdisciplinary researchers can collaborate productively without flattening the distinct expertise each discipline brings.

Serve as a visible leader in evaluation science, representing the team at conferences, workshops, and in the broader research community.

Qualifications

Minimum

Ph.D. in Computer Science, Machine Learning, Statistics, or a closely related field

5+ years of experience managing or leading research teams in an industry setting, with demonstrated ability to attractand retain strong research talent

Experience publishing research at top-tier AI/ML venues (NeurIPS, ICML, ICLR, ACL, EMNLP)

Experience partnering with applied science and engineering teams to translate research into production systems, tools, or capabilities adopted by others

Technical depth in AI evaluation, with the ability to critically assess and advance methods for measuring AI system behavior, whether through automated judgment, benchmark design, synthetic data, human evaluation, or other approaches

Demonstrated ability to set research strategy, manage a research portfolio with competing priorities, and make disciplined investment decisions across near-term and long-term work

Excellent communication skills, including the ability to represent research to executive leadership, partner teams, and the external research community

Preferred

Ability to bridge ML research and measurement science. This could mean a machine learning background with genuinefamiliarity with validity and evaluation design, or a measurement science background with strong technical depth in MLmethods

Publications or demonstrated expertise specifically in evaluation methodology (papers about how to evaluate, not just papers that use evaluation)

Demonstrated ability to coach researchers toward higher-impact publications: improving framing, identifying contribution clarity issues, and helping position work for acceptance at top-tier venues

Strong opinions about how evaluation methods should be implemented in user-facing tools: what defaults, abstractions, and guardrails make the difference between a generic SDK and a world-class evaluation platform

Experience designing research with self-service adoption as a first-class constraint, where the end goal is not a bespoke system your team operates but a method or tool that others can apply correctly without deep knowledge of the underlying research

Track record of personally recruiting research talent in competitive hiring markets, including sourcing candidates who would not have applied through standard channels