About the job
Imagine what you could do here. At Apple, great ideas have a way of becoming phenomenal products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. Do you love solving complex distributed systems challenges at massive scale? Are you passionate about Kubernetes scheduling, resource management, and building platforms that power the next generation of Machine Learning and Data workloads? Do you thrive in designing and operating highly reliable, large-scale job scheduling and orchestration systems that serve as the backbone of AI and Data infrastructure? If so, join the Apple Data Platform team to design and build a scalable batch and ML infrastructure platform used across Apple. As part of Apple Data Platform, you will play a meaningful role in designing, developing, and deploying high-performance systems that power batch and ML workloads across Apple's global infrastructure spanning public clouds and Apple data centers. This enormous scale brings unique and complex challenges in resource scheduling, workload orchestration, and operational excellence that require extraordinarily creative problem-solving.
Responsibilities
Design, build, and deploy highly reliable, large-scale distributed systems for batch processing and ML infrastructure across public clouds and Apple data centers using Go, Java, or Python
Architect and operate Kubernetes-native scheduling systems such as Kueue and YuniKorn, building custom operators and CRDs to manage complex ML and data workloads
Implement advanced scheduling strategies including gang scheduling, topology-aware routing, bin-packing, and fair-share queuing to maximize GPU efficiency and hardware utilization
Build and manage secure, multi-tenant Kubernetes environments with strict resource isolation, quota governance, and priority-based preemption
Drive end-to-end observability, monitoring, and incident response practices to ensure high availability and fault tolerance of production systems
Collaborate with ML researchers, data engineers, SRE, and product teams to integrate scheduling solutions into Apple's broader AI and data platform ecosystem
Contribute to platform adoption by guiding internal customers, gathering requirements, and delivering impactful platform capabilities
Qualifications
Minimum
5+ years of experience designing, developing, and operating highly available, large-scale distributed systems and data or ML infrastructure
Strong software engineering skills with deep programming expertise in Go, Java, or Python
Advanced knowledge of Kubernetes internals including custom controllers, scheduler architecture, resource quotas, and workload lifecycle management
Hands-on experience with Kubernetes-native batch scheduling frameworks such as Kueue or YuniKorn and advanced scheduling concepts like gang scheduling, bin-packing, and priority preemption
Experience with cloud-native infrastructure across multi-cloud environments including AWS, GCP, and on-premises systems
Strong commitment to operational excellence, system observability, and continuous improvement for mission-critical services
Preferred
GPU scheduling, accelerator-aware placement, and optimization for large-scale AI/ML workloads
Experience with distributed data and ML frameworks such as Apache Spark, Ray, PyTorch, JAX, or Flink at scale
Experience contributing to open-source projects in Kubernetes scheduling, container technologies, or ML infrastructure ecosystems such as Apache YuniKorn, Kueue, or similar systems
Experience using GenAI technologies to improve developer productivity, streamline engineering processes, and accelerate team execution