About the job
AI Cloud Platform Site Reliability EngineerThe Opportunity:
Mission users are increasingly relying on agentic AI systems to support complex workflows, accelerate analysis, and improve decision advantage. Unlike traditional software systems, agentic AI platforms introduce operational complexity across model invocations, workflow orchestration, tool integrations, retrieval and knowledge layers, safety controls, and probabilistic outputs. As an AI Platform Site Reliability Engineer (SRE), you’ll help ensure the availability, resiliency, observability, and operational integrity of an AWS GovCloud-based agentic AI platform supporting national defense missions.
In this role, you’ll serve as the reliability owner for production AI operations. You’ll work cross-functionally with multiple stakeholders, including with cloud engineering, platform engineering, AI agent development, MLOps, data science, and customer knowledge teams to operationalize their work in production through monitoring, alerting, Service Level Indicators (SLI) and Service Level Objectives (SLO) management, incident response, ticket triage, change control, and automation. You won’t be duplicating model development, data science, or cloud platform build responsibilities. Instead, you’ll ensure that the system, its agents, and their supporting services remain healthy, traceable, performant, and supportable in mission environments.
Responsibilities
Define, implement, and maintain service level indicators, service level objectives, error budgets, dashboards, alarms, and escalation paths for an agentic AI platform operating in AWS GovCloud.
Monitor end-to-end health and performance of agent workflows, model invocations, retrieval or knowledge integrations, orchestration steps, tool calls, and dependent services.
Triage incidents, alerts, and operational tickets. Lead root-cause analysis, coordinate recovery actions, and drive post-incident corrective actions that reduce mean time to recovery and prevent recurrence.
Build and maintain observability pipelines across metrics, logs, traces, audit telemetry, and operational events using AWS-native tooling and approved enterprise observability tooling.
Establish and tune operational thresholds for latency, availability, error rates, token and cost consumption, workflow success rates, tool failure rates, guardrail interventions, and drift-related signals.
Partner with platform engineers, cloud engineers, AI agent developers, MLOps engineers, data scientists, and customer SMEs to define ownership boundaries, handoffs, rollback criteria, release readiness gates, and operational support models.
Coordinate with MLOps and data science teams when model or data quality degradation, drift, or unexpected behavior requires rollback, retraining, prompt changes, knowledge-base updates, or other corrective actions.
Automate remediation and routine operational tasks using Python, shell scripting, infrastructure as code, and event-driven workflows to reduce manual toil.
Support secure and compliant operations in regulated national defense environments, including auditability, least-privilege access, controlled logging, and disciplined change management.
Work with limited direction, mentor junior team members, and help mature AI operations practices across the program.
Qualifications
Minimum
5+ years of experience supporting production distributed systems such as SRE, Platform Engineering, Cloud Operations, or DevOps
Experience operating workloads on AWS including monitoring, alerting, logging, incident response, troubleshooting, IAM, networking, or secure operations
Experience supporting production AI/ML, generative AI, RAG, agentic AI, model-serving, or data-driven decision systems
Experience defining and operating SLIs, SLOs, error budgets, alert thresholds, runbooks, or operational readiness criteria
Experience with observability tooling across metrics, logs, traces, dashboards, or log analytics, including CloudWatch, OpenTelemetry, Prometheus, Grafana, OpenSearch, or ELK
Experience diagnosing issues across containers, orchestration platforms, or cloud runtimes, such as EKS, ECS, Lambda, or EC2
Experience with Python, Bash, or scripting languages to automate operational tasks, health checks, or remediation workflows
Experience participating in on-call rotations, triaging ticket queues, and leading incident response or post-incident review activities
Secret clearance
Bachelor’s degree
Preferred
Experience with Amazon Bedrock, Bedrock Agents, Guardrails, Knowledge Bases, model invocation logging, EventBridge, CloudTrail, and CloudWatch-based monitoring for AI workloads or equivalent tooling for production agentic AI systems
Experience supporting AWS workloads in GovCloud, FedRAMP High, DoD SRG IL4/5, or other regulated or high-assurance environments
Experience with automation and infrastructure as code using Terraform, CloudFormation, or AWS CDK
Experience with CI/CD release engineering, canary strategies, rollback controls, and change management for cloud services and AI-enabled applications
Experience with Prometheus-compatible monitoring, Grafana, OpenSearch/ELK, or other enterprise observability stacks in containerized environments
Experience supporting GPU-backed inference, self-hosted model serving, or hybrid AI deployments if the platform evolves beyond managed services
Ability to distinguish infrastructure issues from AI-specific failure modes including workflow breakdowns, degraded retrieval, safety interventions, regressions, stale knowledge sources, and model or service throttling
Experience working in Agile and cross-functional environments and collaborating with engineers, operators, mission stakeholders, and technical leadership
AWS Certified CloudOps Engineer, Associate AWS Certified DevOps Engineer, Professional AWS Certified Machine Learning Engineer, Associate AWS Certified Generative AI Developer, Professional AWS Certified Security, or Specialty cloud and AI operations Certifications
CompTIA Security+ or DoD 8570/8140 baseline Certification