Machine Learning Engineer - Visual Agents - Special Projects

Apple
Cupertino, United States of America2026-04-17

About the job

The Special Projects team at Apple is developing novel experiences powered by state-of-the-art agentic vision-language models that incorporate visual context into conversational interaction. We are looking for a Machine Learning Engineer to help us build, fine-tune, and rigorously evaluate these systems. A successful candidate has hands-on experience with vision-language models, knows how to translate ambiguous product requirements into measurable evaluation criteria, and is excited to work at the intersection of multimodal modeling and agentic AI.

Responsibilities

Build and evaluate vision-language agents that perceive real-world scenes and incorporate that context into conversational models

Curate, annotate, and build multimodal datasets to support model training and evaluation

Develop automated evaluation pipelines including LLM-as-judge frameworks, human evaluation protocols, and domain-specific benchmarks

Fine-tune Large Language Models (LLMs) and Visual-Language Models (VLMs) to improve performance for specific use cases

Work closely with other ML Researchers to define evaluation criteria and methodology to systematically evaluate foundation models

Design controlled experiments to measure model capabilities, identify failure modes, and drive iterative model improvements

Conduct robust statistical analysis to identify model deficiencies and failure modes and performance gaps.

Qualifications

Minimum

BA or Master’s degree in Computer Science or Machine Learning

2+ years of hands-on experience building and evaluating generative AI or multimodal models

Experience working with vision-language models or multimodal systems

Proficiency in Python and ML frameworks (Pytorch or Tensorflow)

Preferred

PhD in Computer Science, Machine Learning, Statistics, or other STEM field

Prior industry internship or research experience applying ML to product use cases

Experience with video understanding, temporal reasoning, or activity recognition

Familiarity with agentic system design including tool use, grounding, or perceive-act loops

Experience building or working with large-scale multimodal data and annotation pipelines

Proficiency in training, fine-tuning, and evaluation of foundation models and frameworks

Publications or technical presentations in Machine Learning journals or conferences

Excellent communication skills and cross functional collaboration