About the job
Do you get excited by driving product impact via measurement and evaluation, for products and services used by hundreds of millions of people globally? The vision for the AIML Evaluation organization is to improve products by using data as the voice of our customers. Within this organization the mission of the Data Science and Insight team is to inform product evolution through measurement, evaluation, and analysis of the user experience. You will partner with Apple Intelligence engineering teams to improve product quality and guide feature development with data, to deliver amazing experiences across iPhone, iPad, HomePod, Mac, Apple Watch, Apple tv, across dozens of languages.
Responsibilities
Design and Own End-to-End Evaluation Frameworks: Develop rigorous evaluation methodologies for AI/ML systems, including metric definition, sampling strategy, experiment design, and statistical validity checks. Build scalable pipelines that ensure trustworthy, reproducible, and interpretable results across product surfaces and model iterations.
Build High-Quality Evaluation Datasets & Human-in-the-Loop Systems: Create and maintain gold-standard datasets for offline and online model assessment. Lead data generation and annotation workflows (e.g., human ratings, Red Teaming, preference data, domain-specific evals), ensuring coverage, data quality, bias mitigation, and alignment with product and safety goals.
Partner Cross-Functionally to Drive Model & Product Decision-Making: Translate evaluation insights into actionable recommendations for model training, ranking, and product launches. Collaborate closely with Research, Engineering, Product, and Safety teams to define quality bars, monitor regressions, optimize user experience, and guide roadmap prioritization.
Qualifications
Minimum
Experience in data science, machine learning, and analytics, including statistical data analysis and A/B testing.
Experience articulating and translating business questions and using statistical techniques to arrive at an answer using available data.
Strong programming skills, including data-querying skills (SQL and/or Spark, etc.) and experience with a scripting language for data processing and development (e.g., Python, R, or Scala).
Excellent collaboration skills to achieve impactful results by working effectively with diverse cross-functional teams, including PMs, engineers, data scientists, and others.
B.S. in Machine Learning, Computer Science, Statistics, Operations Research or other quantitative fields.
Preferred
Applicants have a good understanding of large language model (LLMs), including their architecture, training methods, prompt engineering and fine-tuning for specific tasks.
Hands-on experience in applying LLMs to solve technical problems, such as data analysis, data automation, synthetic data generation, with proven ability to optimize model performance for accuracy and efficiency.
Ph.D. in machine learning, computer science, statistics, operations research or other quantitative fields.
5 years of relevant work experience.