Sr Machine Learning Engineer

Disney
Lake Buena Vista, FL, USA / USA - CA - 820 S Flower St, USA - FL - Kirkman Point 1, USA - WA - 925 4th Ave2026-04-15Full time

About the job

The Machine Learning / Software Engineer plays a critical role in designing, developing, and implementing self-healing infrastructure management systems for enterprise-wide, production environments. This role combines deep expertise in machine learning, AI technology, software engineering, and DevOps to create reusable patterns, frameworks, and services to improve reliability across Services and Platforms. The candidate will serve as a thought leader, identifying opportunities for and applying advanced analytics, predictive modeling, and AI to large-scale telemetry, changes, events, and incident data to derive actionable insights. The role focuses on building, deploying, and operating machine learning models that proactively detect issues, predict failures, and drive automated, self-healing remediation across enterprise systems.

Responsibilities

Work alongside our first-class applications, infrastructure & operations teams to understand current manual processes and business requirements

Architect, design, and implement reusable machine learning frameworks, patterns, and services that integrate into the enterprise automation and observability platforms

Design, train, and deploy machine learning models for anomaly detection, forecasting, predictive analytics, event correlation, pattern recognition, classification, causal analysis, and more in distributed environments that can be used to surface leading indicators of failure

Build near-real-time inference pipelines that generate actionable insights from live telemetry, including continuous streams of metrics, logs, traces, and operational events

Create data abstractions and perform feature engineering on high-volume, high-cardinality telemetry data

Evaluate model performance using real production signals and continuously iterate to improve accuracy and reliability

Build closed-loop, event-driven systems where model signals trigger automated remediation actions

Partner with infrastructure and SRE teams to identify opportunities and integrate machine learning and AI-driven insights into operational tools, workflows, and dashboards

Analyze incident and historical data to uncover leading indicators and predictive signals

Own the full machine learning lifecycle: experimentation, validation, deployment, monitoring, and retraining

Breakdown targeted, manual processes into reusable software modules that leverage machine learning models

Build emulation and simulation environments (digital twins) of the infrastructure to test AI/ML-driven automation under realistic scenarios and allow for faster ideation and iteration for architects and engineers.

Develop algorithms and frameworks to integrate machine learning and AI technologies into our orchestration platform

Ensure service reliability, performance, and operational uptime through code-driven solutions.

Conduct root cause analysis, design fault-tolerant architectures, and enable self-healing automation.

Implement monitoring dashboards and KPIs to provide visibility into automation and tooling performance.

Collaborate with cross-functional teams including network engineers, software developers, machine learning engineers, and operations teams across the enterprise.

Support the integration of commercial and open-source tools while maintaining a vendor-agnostic implementation

Qualifications

Minimum

7+ years of software engineering experience, with expertise in automation, machine learning, and AI technologies

Proven hands-on experience building production-grade ML models and inference pipelines; strong proficiency with modern ML frameworks such as PyTorch, TensorFlow, Scikit-learn, etc.

Proven hands-on experience using software to build frontend, APIs and backend functionality; strong proficiency with Python, JavaScript, TypeScript, Go, or Rust

Strong hands-on experience building and deploying event-driven or streaming data, machine learning models in production

Solid foundation in statistics, data analysis, and applied machine learning techniques

Experience working with large-scale, real-world datasets (noisy, incomplete, non-standardized, and evolving)

Experience operationalizing models in distributed, production environments

Ability to translate ambiguous operational problems into solvable machine learning use cases

Experience with modern cloud platforms, container orchestration (Kubernetes/Docker), identity/auth frameworks, data and workflow orchestration.

Experience with AI/ML technologies and data engineering concepts. Preferred: Proven hands-on building AI agents.

Demonstrated success designing and building enterprise-scale systems and reusable software frameworks.

Strong communication, collaboration and leadership skills

Applies systems thinking to understand how individual components fit into larger, more holistic solutions.

Capable of quickly shifting between detailed, hands-on work and high-level strategic thinking.

Preferred

Certifications such as Kubernetes (CKA/CKAD), AWS/Azure/GCP certifications, CCNP/DevNet or NVIDIA AI engineer.

Experience developing low-code/no-code automation platforms or reusable developer toolkits.

Contributions to open-source automation, machine learning, AI, observability, or DevOps communities.

Applying unsupervised and semi-supervised learning for anomaly detection and signal discovery

Applying complex event processing and event correlation techniques

Building time-series forecasting models for capacity, latency, and failure prediction

Experience with feature stores, offline/online feature pipelines, and feature reuse

Implementing model monitoring for drift, bias, and performance degradation

Experience with reinforcement learning or decision models for automated remediation and optimization

Working with real-time or near-real-time inference pipelines

Experience labeling, curating, and managing training data derived from production telemetry

Experience mentoring engineers, sharing knowledge, and fostering a learning culture

Demonstrated curiosity and continuous learning mindset, with a passion for exploring emerging AI/ML, automation, and platform technologies