System Development Manager, Cloud compute/gpu/storage server team

About the job

We are looking for a forward-thinking technical leader to manage a diverse, cross-functional team of Hardware Design Engineers, Systems Development Engineers, and Technical Program Managers responsible for developing storage or accelerated (AI/ML/GPU) server platforms for AWS.

Responsibilities

Set the technical vision and multi-generational roadmap for storage or accelerator (AI/ML/GPU-based) server platforms

Make architectural bets that differentiate AWS — anticipating customer needs and industry shifts before they become obvious

Manage a team of hardware architects in defining server platform architectures that optimize for performance, reliability, cost, and speed of customer adoption

Translate deep understanding of customer workloads (storage, AI/ML training, inference) into hardware design decisions

Influence the broader AWS hardware strategy through data, conviction, and results

Own server platform development from architecture through detailed design, prototype, build, and qualification

Manage a team of engineers responsible for design, build and launch of systems

Lead ODM/JDM and design partner relationships, ensuring our requirements for performance, quality, testability, and diagnostics are met

Drive design verification, system validation, and qualification — ensuring platforms meet reliability, performance, and cost targets before deployment

Ensure systems are designed for operational excellence from day one — testability, diagnosability, and serviceability are built in, not bolted on

Own deployment to the data center, launch readiness, and successful ramp into production

Drive qualification and readiness milestones, removing technical and organizational blockers to get servers into the fleet

Own fleet health beyond launch — your responsibility never ends. Monitor quality, reliability, and customer experience for the life of the platform

Drive toward zero-touch operations — building automation infrastructure that detects, diagnoses, and remediates faults before customer impact

Build predictive failure detection capabilities using telemetry, error trending, and log correlation

Establish and track fleet health metrics (failure rates, MTTD, MTTR, first-time fix rate, predictive accuracy)

Close the loop between field failures and design improvements in next-generation platforms

Manage and grow a diverse team spanning hardware engineering, systems development, and technical program management

Hire, develop, and retain top talent across multiple engineering disciplines

Create an environment where engineers with fundamentally different expertise (hardware, firmware, software, program management) collaborate effectively and challenge each other

Set clear goals, remove obstacles, and hold the team to high standards on delivery and quality

Coach and develop senior technical leaders — help architects think bigger and help execution-focused engineers see the strategic picture

Partner with AWS service teams to ensure server platforms meet data path and control path requirements and drive fast adoption

Work with supply chain, manufacturing, and datacenter operations teams to deliver at scale

Influence peer teams and senior leadership on technical direction, investment priorities, and trade-offs

System Development Manager, Cloud compute/gpu/storage server team

About the job

Responsibilities

Qualifications

Minimum

Preferred