About the job
The Google Kubernetes Engine (GKE) AI Platform team is responsible for managing containerized workloads and services on the cutting-edge AI infrastructure (GPUs/TPUs) using Kubernetes (K8s)—an open-source platform. As GKE experiences exponential growth in the AI, ML, and GenAI sectors, our team ensures that this infrastructure is automated, frictionless, and capable of handling the next generation of cloud computing.
Responsibilities
Act as an AI Platform TL, driving innovation on GKE AI/ML infra reliability, efficiency and scale.
Engage with Megawhale customers to ensure their success/growth on GKE/Google Cloud Platform (GCP).
Identify gaps and drive improvement across entire GKE/Google Compute Engine (GCE) stack.
Help shape the culture of the team to be a high executing team that is fun to work with.
Lead the technical goal for GKE AI/ML workload efficiency and optimization, setting the direction.
Qualifications
Minimum
Bachelor's degree or equivalent practical experience.
8 years of experience in software development.
5 years of experience in cloud computing and building operating systems.
Preferred
Master's degree or PhD in Computer Science, Machine Learning, or a related field.
5 years of experience with distributed systems, data analytics, and applied ML.
Experience with AI infrastructure (GPU/TPU, Networking, etc.) management and orchestration.
Experience with machine learning infrastructure, large-scale distributed systems, and Cloud.
Excellent problem-solving, code and model tuning skills.