About the job
Join Amazon's Fulfillment Technologies & Robotics (FTR) team to spearhead the product vision for a platform that ensures Amazon's fulfillment network never stops — even as we move toward fully self-governing, zero-touch operations. You'll own the roadmap for an AI-powered infrastructure reliability platform that prevents, detects, and resolves incidents across thousands of fulfillment sites globally.
Responsibilities
Own and drive the multi-year product roadmap for the Infrastructure Reliability AI-Ops platform, spanning three strategic programs: zero-touch incident resolution, associate-directed work tooling, and predictive failure prevention. This means defining the vision, strategy, and success metrics for AI-powered progressive detection, incident consolidation, self-governing remediation orchestration, and cross-domain observability capabilities that serve thousands of fulfillment sites globally.
Go beyond traditional product management by writing code and delivering working proof-of-concepts that validate technical hypotheses before committing engineering resources. Whether prototyping a multi-agent reasoning pipeline, exploring a new anomaly detection approach, or stress-testing an LLM prompt chain against real incident data, you will use your technical skills to compress the distance between idea and validated direction.
Bring deep knowledge of machine learning fundamentals and apply that knowledge to shape how the platform detects, consolidates, and reasons about failures. You will engage meaningfully with data scientists on model architecture selections, feature engineering tradeoffs, and evaluation frameworks — comprehending not just what a model produces but why, and whether that reasoning can be trusted in a production environment where self-governing remediation choices carry real operational risk.
Apply your interpetation of AI reasoning techniques — including chain-of-thought prompting, retrieval-augmented generation, confidence calibration, and evidence accumulation — to define how the platform builds progressive confidence about incident severity and failure origin rather than making binary selections from rigid thresholds. You will shape how LLMs are applied to diagnostic summarization, resolution suggestion, and automated stakeholder communication.
Define the multi-agent architecture that orchestrates detection, investigation, consolidation, diagnosis, and remediation as a coordinated system rather than isolated capabilities. You will work with engineering to define agent roles, communication protocols, handoff conditions, and safety boundaries ensuring that self-governing agents act with appropriate confidence and escalate appropriately when uncertainty is high.
Translate complex operational and technical requirements into a prioritized backlog, making clear tradeoffs between feature depth, platform scalability, and autonomous site readiness milestones. You will serve as the voice of Incident Managers, domain engineers, and Operations Control Center stakeholders, deeply understanding their daily workflows and advocating for their needs during executive-level planning and prioritization.
Define and track the business case across all three programs — including mean time to resolve improvements, lost labor hour reduction, and first page resolution improvement — to secure continued investment. You will establish mechanisms to measure platform performance against key metrics including auto-detection rate, false positive rate, consolidation accuracy, and remediation success rate, iterating rapidly based on data.
Drive cross-functional alignment across Fulfillment Technologies, Robotics, Network Engineering, Application teams, and Operations to ensure the platform's cross-domain orchestration model is well understood and adopted. You will lead executive-level reviews of program progress, risks, and investment cases, communicating clearly about the path from near-term detection improvements to longer-term autonomous site readiness.
Qualifications
Minimum
Master's degree
Experience owning/driving roadmap strategy and definition
Experience with feature delivery and tradeoffs of a product
Experience contributing to engineering discussions around technology decisions and strategy related to a product
Experience managing technical products or online services
Experience in representing and advocating for a variety of critical customers and stakeholders during executive-level prioritization and planning
Experience with technologies like Hadoop, Rules based systems and Machine learning
Preferred
Experience in using analytical tools, such as Tableau, Qlikview, QuickSight
Experience in building and driving adoption of new tools