Site Reliability Engineer (SRE) - AI Platform & Cloud

About the job

This role is for an experienced and driven Site Reliability Engineer (SRE) to join our AI Platform team to help support, scale and harden the infrastructure that powers our AI/ML systems. You will collaborate closely with infrastructure engineering, cloud engineering, data engineering, and security teams to ensure availability, reliability, performance, and security of production AI workloads (training, inference, data pipelines) in a regulated, high-stakes financial environment.

Responsibilities

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

Design and build automation for core platform capabilities, reducing manual toil

Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards

Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

Optimize cost vs. performance tradeoffs in large-scale compute environments

Harden systems for security, compliance, auditability, and data governance

Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems

Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms

Maintain runbooks, operational playbooks, documentation, and training materials

Participate in on-call rotations and respond to production incidents 24/7 as needed

Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Qualifications

Minimum

Bachelor’s or Master’s degree in Computer Science or related field, or equivalent job experience

5 years of production experience in SRE / Infrastructure / ops for large-scale systems

Strong programming/scripting skills (Python, Go, Java, or equivalent)

Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)

Solid experience in capacity planning, performance tuning, scaling, and incident response

Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements

Experience in regulated environments (financial services, compliance, audit, security) is a strong plus

Excellent communication, documentation, and cross-team collaboration skills

Proven track record of reducing operational toil via automation

Preferred

Understanding of SRE techniques.

Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.

Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.

Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)

Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc.) for cloud app storage.

Experience working with Generative AI development, embeddings, fine tuning of Generative AI models.

Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)

Understanding of ModelOps/ ML Ops/ LLM Op.

Experience with chaos engineering, canary deployments, blue/green rollouts