About the job
As a Principal Machine Learning Engineer within Reliability, you will set the 3-5 year technical strategy and architectural blueprint for how machine learning systems/practices can be leveraged to improve the reliability of the overall Roblox platform. You will own the architectural and execution roadmap of leveraging massive data across - logs, traces, metrics, production changes, to proactively detect issues before they become real problems (MTTD) and/or reduce time to resolve incidents (MTTR).
Responsibilities
Define the strategy of leveraging Machine Learning Engineering to improve Production Systems Reliability at Roblox.
Improve realtime anomaly detection capabilities by leveraging various state of the art ML techniques, thereby directly contributing to improving Mean Time to Detect Production issues.
Develop methods to build pipelines to consume various streams of data (metrics, logs, traces, change management systems etc.).
Build a reasoning layer that interacts with the streams of data to find possible root causes of problems happening in production.
Build time-series models to predict capacity exhaustion and seasonal traffic spikes to drive automated scaling
Qualifications
Minimum
Beyond off the shelf: We are looking for an expert who has knowledge of various modeling techniques, ability to go deep and fine tune models to fit our use cases.
Ability to propose and architect the infrastructure that allows us to implement systems that learn from user and/or automated feedback.
Good distributed systems fundamentals and understanding of large scale high throughput systems
Preferred
No preferred qualifications listed.