Research Scientist, AI Evaluation Science

Apple
Seattle, United States of America2026-03-03

About the job

AI systems are only as trustworthy as the methods used to evaluate them. At Apple, where AI powers experiences for billions of people, getting evaluation right is not a support function—it is a foundational science. Our team, part of Apple Services Engineering, is building that scientific foundation: rigorous, scalable evaluation methodology for LLMs, agentic systems, and human-AI interaction.

Responsibilities

Advance evaluation methodology through original research in one or more of the following areas: preference learning and reward modeling (RLHF, DPO, reward hacking mitigation); LLM-as-judge calibration, rubric design, and bias detection; intelligent evaluation strategies including active learning for test selection and automated failure discovery; or validity frameworks for evaluators (construct validity, transfer learning). You are not expected to cover all of these—depth matters more than breadth.

Publish at top-tier venues (NeurIPS, ICML, ICLR, ACL, EMNLP), contributing to evaluation science as a recognized research area and representing Apple in the research community.

Translate research into production-ready tools by partnering with platform engineers to productionize your methods into evaluation SDKs and APIs used across Apple.

Collaborate with measurement scientists to integrate psychometric methods and validity frameworks into evaluation systems, ensuring evaluators measure what they claim to measure.

Define the team's research agenda for evaluation science by identifying high-leverage open problems, validating that they address real-world challenges faced by ML engineers across Apple, and designing rigorous experimental programs to solve them.

Qualifications

Minimum

Ph.D. in Computer Science, Machine Learning, or a closely related field, with a research focus in evaluation-adjacent areas (preference learning, RLHF, human feedback, calibration, automated assessment)

Strong publication record at top-tier conferences (NeurIPS, ICML, ICLR, ACL, EMNLP), including first-author publications demonstrating independent research contributions

Deep technical expertise in at least one evaluation-adjacent ML area, with strong mathematical foundations: preference learning and reward modeling (RLHF, DPO, reward hacking, specification gaming); OR calibration theory, proper scoring rules, and statistical reliability; OR human-AI interaction methodology (active learning, annotation quality, preference elicitation)

Demonstrated ability to implement complex methods from recent papers and run large-scale experiments

Track record of translating research into practical systems—prototypes, tools, or methods adopted by others

Excellent written and verbal communication skills, including the ability to write clear research papers and explain complex concepts to diverse audiences

Preferred

Publications specifically on evaluation methodology—papers about how to evaluate, not just papers that use evaluation to demonstrate model improvements

Strong hands-on experience with modern ML frameworks (PyTorch, JAX, or TensorFlow) and training or fine-tuning large language models

Experience with theoretical foundations of evaluation: measurement theory and validity frameworks, statistical learning theory (calibration, reliability, decision theory), or preference elicitation and aggregation

Specific research experience in one or more of: reward modeling and RLHF for alignment; LLM-as-judge approaches (calibration, rubric design, bias mitigation); benchmark design and validation (IRT, contamination detection); human evaluation methodology (protocol design, quality control); or agentic and multi-agent system evaluation

Demonstrated passion for evaluation as a research area: conference presentations, workshops, or tutorials on evaluation topics; open-source contributions to evaluation tools or benchmarks; active engagement with the evaluation research community

Experience with cross-disciplinary research, such as collaboration with social scientists, psychometricians, or domain experts