About the job
As part of Optum AI, UnitedHealth Group's enterprise AI organization, you will build and scale production-grade machine learning and generative AI systems that directly impact patient outcomes, clinical efficiency, and enterprise automation. This team operates at the intersection of healthcare and cutting-edge AI-developing platforms and capabilities used across the enterprise.
Responsibilities
Design, build, and maintain end-to-end ML platforms and pipelines (training, validation, deployment, and monitoring)
Productionize ML models using batch and real-time inference architectures (APIs, streaming, event-driven systems)
Develop and manage ML lifecycle workflows using tools such as MLflow, Kubeflow, SageMaker, or Azure ML
Build and maintain CI/CD pipelines for ML (CI/CT/CD), including automated testing, validation, and model promotion
Containerize and deploy ML workloads using Docker and Kubernetes, ensuring scalability and reliability
Implement infrastructure-as-code (Terraform or equivalent) for reproducible and secure ML environments
Develop monitoring and observability solutions for model performance, drift, latency, and data quality
Automate retraining and redeployment workflows based on performance degradation or new data availability
Partner with cross-functional teams to define and enforce ML engineering standards and best practices
Ensure compliance with enterprise governance, security, and Responsible AI requirements
Qualifications
Minimum
Bachelor's degree in Computer Science, Engineering, or related field OR 4+ years of equivalent experience
5+ years of experience in ML Engineering / MLOps with production deployment of machine learning systems
3+ years of experience with ML lifecycle tools (MLflow, Kubeflow, SageMaker, Azure ML, or similar)
3+ years of experience with Docker and Kubernetes in production environments
3+ years of experience building CI/CD pipelines for ML using Git-based workflows and automation tools
2+ years of experience with cloud platforms (AWS, Azure, or GCP) for ML workloads
Experience with real-time and batch inference systems (e.g., Kafka, Kinesis, Event Hubs)
Solid programming experience in Python (5+ years) with ML frameworks (PyTorch, TensorFlow, or scikit-learn)
Preferred
7+ years of experience in ML engineering or distributed systems
Experience with feature stores (e.g., Feast) and data versioning systems
Hands-on experience with distributed data processing frameworks (Spark, Ray)
Experience with workflow orchestration tools (Airflow, Dagster, Prefect)
Experience with multi-cloud or hybrid cloud ML deployments
Knowledge of Responsible AI, bias detection, and model explainability techniques
Familiarity with observability tools (Prometheus, Grafana, OpenTelemetry)
Proven contributions to open-source ML or MLOps projects