Senior Software Engineer, Model Inference

About the job

Join Apple Maps to help build the best map in the world. In this role on ML Platform, you will help bring advanced deep learning and large language models into high-volume, low-latency, highly available production serving, improving search quality and powering experiences across Maps. You will partner closely with research and product teams, take end-to-end ownership, and deliver measurable results at global scale.

Responsibilities

Own the technical architecture of large-scale ML inference platforms, defining long-term design direction for serving deep learning and large language models across Apple Maps.

Lead system-level optimization efforts across the inference stack, balancing latency, throughput, accuracy, and cost through advanced techniques such as quantization, kernel fusion, speculative decoding, and efficient runtime scheduling.

Design and evolve control-plane services responsible for model lifecycle management, including deployment orchestration, versioning, traffic routing, rollout strategies, capacity planning, and failure handling in production environments.

Drive adoption of platform abstractions and standards that enable partner teams to onboard, deploy, and operate models reliably and efficiently at scale.

Partner closely with research, product, and infrastructure teams to translate model requirements into production-ready systems, providing technical guidance and feedback to influence upstream model design.

Optimize inference execution across heterogeneous compute environments, including GPUs and specialized accelerators, collaborating with runtime, compiler, and kernel teams to maximize hardware utilization.

Establish robust observability and performance diagnostics, defining metrics, dashboards, and profiling workflows to proactively identify bottlenecks and guide optimization decisions.

Provide technical leadership and mentorship, reviewing designs, setting engineering best practices, and raising the quality bar across teams contributing to the inference ecosystem.

Continuously evaluate emerging research and industry trends in LLM inference, distributed systems, and ML infrastructure, driving the transition of high-impact ideas into production systems.

Qualifications

Minimum

Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).

5+ years in software engineering focused on ML inference, GPU acceleration, and large-scale systems.

Expertise in deploying and optimizing LLMs for high-performance, production-scale inference.

Proficiency in Python, Java or C++.

Experience with deep learning frameworks like PyTorch, TensorFlow, and Hugging Face Transformers.

Experience with model serving tools (e.g., NVIDIA Triton, TensorFlow Serving, VLLM, etc)

Experience with optimization techniques like Attention Fusion, Quantization, and Speculative Decoding.

Skilled in GPU optimization (e.g., CUDA, TensorRT-LLM, cuDNN) to accelerate inference tasks.

Skilled in cloud technologies like Kubernetes, Ingress, HAProxy for scalable deployment.

Preferred

Master’s or PhD in Computer Science, Machine Learning, or a related field.

Understanding of ML Ops practices, continuous integration, and deployment pipelines for machine learning models.

Familiarity with model distillation, low-rank approximations, and other model compression techniques for reducing memory footprint and improving inference speed.

Strong understanding of distributed systems, multi-GPU/multi-node parallelism, and system-level optimization for large-scale inference.