Engineer, AI/ML Operations

Comcast
PA - Philadelphia, 1800 Arch St2026-04-22Full time

About the job

Xfinity Data Platform (XDP) is an intelligence warehousing platform which provides centralized information and control capabilities over its subscriber’s local area network plus acquire, store and aggregate data related to end user connected devices to Comcast infrastructure. In addition, a modular intelligence logic allows for granular visibility and control of subscriber’s networks, providing device-centric value-added capabilities of creating and managing content access control policies, creating and managing performance policies or issuing presence notifications to the Comcast data ecosystem.

Responsibilities

Maintaining and evolving the XDP platform, which underpins key services across Connected Living.

Supporting new infrastructure and operational requirements for Archetype, MLOps, and AIOps initiatives.

Ensuring the platform remains reliable, secure, and scalable during a period of rapid expansion.

Approximately 90% of our services run on Amazon EKS, and this engineer provides deep specialization in: Kubernetes operations and performance tuning Cluster stability and security Autoscaling, networking, and observability for mission-critical components

This role is uniquely skilled in running machine learning systems in production, including: Managing full ML lifecycle deployment pipelines Monitoring, retraining, and model governance Operationalizing AIOps patterns that reduce toil and elevate automation maturity

The engineer is responsible for operating and troubleshooting critical distributed systems, including: EKS, Spark, Kafka, Redis, and Zookeeper Multi-layered data and compute pipelines that support real-time and batch workloads High-availability and disaster-recovery architecture

We rely on this role to: Build high-efficiency automation frameworks Reduce manual operational work using Python or Golang Accelerate engineering velocity with AI-assisted tooling such as GitHub Copilot

This position manages Dev, Staging, and Production workloads across multiple regions globally. This includes: Environment parity Compliance and regional configuration differences Supporting 24/7 global infrastructure

Historically, engineers in this role have delivered measurable value by: Reducing compute and storage costs Eliminating waste in Kubernetes clusters Optimizing data pipelines and observability tooling

Qualifications

Minimum

No minimum qualifications listed.

Preferred

No preferred qualifications listed.