About the job
We’re seeking an experienced AI Platforms Leader to own the strategy, architecture, and operation of our end-to-end AI Platform—spanning on-prem GPU clusters and cloud services (AWS/GCP/Azure). You’ll lead a high-caliber engineering team to deliver reliable, secure, and cost-efficient infrastructure for training, fine-tuning, inference, retrieval, and agentic orchestration (including A2A patterns and MCP servers). If you love turning complex AI/ML requirements into robust, self-service platform capabilities for builders across the company, this is your role. This role requires full-time onsite work in San Diego, CA (5 days per week).
Responsibilities
Own the AI Platform strategy & roadmap
Define the multi-year vision for a multi-tenant, hybrid (on-prem + cloud) AI platform, aligned to business needs, developer productivity, and cost efficiency.
Establish clear platform SLAs/SLOs, reliability goals, and security/compliance guardrails.
Run GPU-based compute at scale
Operate and optimize on-prem GPU clusters (e.g., Kubernetes + GPU operator and/or Slurm), including capacity planning, scheduling, partitioning, NCCL, and high-throughput storage/networking.
Drive GPU utilization efficiency, right-sizing, and cost transparency across training and inference workloads.
Deliver MLOps & LLMOps as a product
Provide golden paths for data prep, training/fine-tuning, model registry, lineage, governance, evaluation, red-teaming, and safe deployment (batch, online, streaming).
Implement CI/CD for models, prompts, and agents; automate evaluations and rollout/rollback with canaries, A/B, and shadow deployments.
Agentic AI, A2A, and MCP ecosystem
Lead the design and operation of agentic orchestration (A2A patterns), tool integration, and MCP (Model Context Protocol) servers to securely expose enterprise tools and data.
Standardize agent capability schemas, guardrails, observability, and policy enforcement.
Cloud AI/ML platforms
Leverage AWS/Azure AI services for training and inference (e.g., Bedrock/SageMaker/EKS; Azure AI Studio/Azure ML/AKS/Azure OpenAI) with robust networking, identity, secrets, and cost controls.
Establish multi-cloud patterns for portability, resilience, and vendor risk management.
Platform engineering & DevOps excellence
Own core platform services: identity/RBAC, secrets, service meshes, observability (logs/metrics/traces), data access controls, vector stores, feature stores, and model gateways (e.g., KServe/Triton/vLLM).
Use GitOps/IaC (Terraform/Bicep/Helm) and secure software supply chain practices (SBOMs, image signing, policy as code).
Operational leadership
Lead a ~10-engineer global team (platform, SRE, MLOps/LLMOps) with global collaboration, 24×7 readiness, and a healthy on-call rotation.
Drive incident response, post-mortems, and continuous improvement. Partner with Security, Legal, and Compliance for model/data governance.
Stakeholder & vendor management
Partner with product, data, and application teams to enable high-impact AI use cases.
Manage strategic vendors (e.g., cloud, GPU, enterprise AI tooling) and negotiate licenses/SOWs aligned to roadmap and budget.
Qualifications
Minimum
• Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 8+ years of Software Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 7+ years of Software Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field and 6+ years of Software Engineering or related work experience.
• 4+ years of work experience with Programming Language such as C, C++, Java, Python, etc.
Preferred
Master’s or PhD in CS/EE/Math or related field.
Experience with:
Training & Inference stacks: PyTorch, CUDA/cuDNN, Triton Inference Server, vLLM, KServe, Ray, Slurm.
Data & storage: High-throughput storage (e.g., Lustre, BeeGFS, Ceph), vector databases (e.g., FAISS, Milvus, Pinecone, Azure AI Search), feature stores (e.g., Feast).
MLOps toolchain: MLflow/Vertex/Azure ML/SageMaker registries, Airflow/Argo, Weights & Biases, LangSmith, Prompt/version management.
Security & governance: OIDC/RBAC, policy as code (OPA), secrets management (AWS Secrets Manager/Azure Key Vault), model governance/risk controls, privacy/PII safeguards.
Agentic frameworks: Semantic Kernel, LangChain, CrewAI, AutoGen (or equivalents) and experience integrating enterprise tools via MCP.
Proven track record shipping platform capabilities that enable multiple product teams (self-service, docs, SDKs, templates, golden paths).
Strong communication with executives and technical leaders; clear metrics, dashboards, and business value storytelling.