About the job
The Special Projects team at Apple is developing novel experiences powered by state-of-the-art agentic vision-language models that incorporate visual context into conversational interaction. We are looking for a Machine Learning Engineer to help us build, fine-tune, and rigorously evaluate these systems. A successful candidate has hands-on experience with vision-language models, knows how to translate ambiguous product requirements into measurable evaluation criteria, and is excited to work at the intersection of multimodal modeling and agentic AI.
Responsibilities
Build and evaluate vision-language agents that perceive real-world scenes and incorporate that context into conversational models
Curate, annotate, and build multimodal datasets to support model training and evaluation
Develop automated evaluation pipelines including LLM-as-judge frameworks, human evaluation protocols, and domain-specific benchmarks
Fine-tune Large Language Models (LLMs) and Visual-Language Models (VLMs) to improve performance for specific use cases
Work closely with other ML Researchers to define evaluation criteria and methodology to systematically evaluate foundation models
Design controlled experiments to measure model capabilities, identify failure modes, and drive iterative model improvements
Conduct robust statistical analysis to identify model deficiencies and failure modes and performance gaps.
Qualifications
Minimum
BA or Master’s degree in Computer Science or Machine Learning
2+ years of hands-on experience building and evaluating generative AI or multimodal models
Experience working with vision-language models or multimodal systems
Proficiency in Python and ML frameworks (Pytorch or Tensorflow)
Preferred
PhD in Computer Science, Machine Learning, Statistics, or other STEM field
Prior industry internship or research experience applying ML to product use cases
Experience with video understanding, temporal reasoning, or activity recognition
Familiarity with agentic system design including tool use, grounding, or perceive-act loops
Experience building or working with large-scale multimodal data and annotation pipelines
Proficiency in training, fine-tuning, and evaluation of foundation models and frameworks
Publications or technical presentations in Machine Learning journals or conferences
Excellent communication skills and cross functional collaboration