Technical Program Manager- AI Cluster Engineering

About the job

We are seeking an experienced Technical Program Manager to drive end-to-end execution of AI cluster engineering programs spanning GPU platforms, rack-scale solutions, high-speed networking, and datacenter AI infrastructure. You will work cross-functionally to translate customer and internal requirements into executable plans, manage risks and dependencies, and deliver scalable, production-ready solutions across GPU → rack → cluster deployments.

Responsibilities

Define, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness. Create and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports. Identify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas.

Own program execution for rack- and cluster-network enablement, including topology decisions, switching/optics/cabling readiness, and validation schedules for scale-out operation. Drive alignment on advanced AI networking requirements such as network architecture, and reliability impacts that require mitigation. Partner with internal/external stakeholders to track and close network blockers

Lead cross-functional delivery for rack solutions that integrate CPU + GPU+ NICs, ensuring end-to-end readiness across hardware, firmware, and management interfaces. Drive requirements capture and execution planning for rack-scale deployments (rack density, rack form factor, power targets, whips, liquid cooling etc.) and ensure integration plans are validated with engineering and operations.

Own program coordination for pod/rack manageability solutions, aligning requirements and milestones for inventory, health monitoring, cluster provisioning, and observability across large-scale deployments. Coordinate with platform/automation teams on cluster provisioning and orchestration.

Drive readiness for rack-level automation and regression workflows (scripts, log mapping, infrastructure automation planning), planning execution to de-risk hardware arrival timing. Partner with CI/CD and FW automation stakeholders to align on deliverables, and validation gates.

Qualifications

Minimum

Proven program management experience delivering complex, cross-functional hardware/software infrastructure programs (server/rack/cluster environments). Strong understanding of datacenter building blocks and lifecycle: servers, racks, clusters, HW/FW/SW integration, and readiness/validation flows. Demonstrated ability to build and run schedules, manage risks, lead matrix teams, and communicate clearly to engineering and executive audiences. Strong working knowledge of program tools (e.g., Jira/Confluence/Microsoft Office) and dashboard-based execution management.

Preferred

AI cluster networking domain experience (NICs, switching/optics/cabling, topologies). Familiarity with rack/pod management and operations concerns (telemetry, health monitoring, power control, FW provisioning, management networks). Experience leading programs for integration of servers in OEMs, ODMs, or data centers. Demonstrated horizontal leadership across large matrix organizations. Formal PM education/certification (PMP / Scrum Master) preferred.