About the job
Scale GP is building the infrastructure that makes enterprise AI seamless. We are looking for a Senior or Staff Infrastructure Engineer to act as a primary technical lead, engineering the 'paved road' for our knowledge retrieval and inference engines. You won't just be managing resources; you’ll be defining the deployment standards for Agentic workflows at scale. Your mission is to bridge the gap between complex AI orchestration and world-class infrastructure, ensuring our platform remains the most reliable destination for enterprise agents
Responsibilities
Architect multi-cloud systems and abstractions to allow the SGP platform to run on top of existing Cloud providers.
Use our own data and AI platform to analyze build and test logs and metrics to identify areas for improvement.
Define the architectural patterns for our multi-cloud infrastructure to support secure, reliable, and scalable Agentic workflows for enterprise customers.
Enhance engineering and infrastructure efficiency, reliability, accuracy, and response times, including CI/CD processes, test frameworks, data quality assurance, end-to-end reconciliation, and anomaly detection.
Collaborate with platform and product teams to develop and implement innovative infrastructure that scales to meet evolving needs.
Design and champion highly scalable, reliable, and low-latency infrastructure and frameworks for building, orchestrating, and evaluating multi-agent systems at enterprise scale.
Lead the infrastructure roadmap with a strong focus on compliance, privacy, and security standards, including designing change management and data isolation strategies.
Own the development and maintenance of our best-in-class Agentic observability platform (logging, metrics, tracing, and analytics) to proactively ensure system health and enable rapid incident response.
Drive developer efficiency by building automated tooling and championing Infrastructure-as-Code (IaC) paradigms throughout the engineering organization to improve workflows and operational efficiency.
Qualifications
Minimum
Proven experience in a senior role, with 5+ years of full-time software engineering experience.
Deep understanding of modern infrastructure practices, including CI/CD, IaC (e.g., Terraform, Helm Charts), container orchestration (e.g., Kubernetes) and observability platforms (e.g., Datadog, Prometheus, Grafana).
Extensive experience with at least one major cloud provider (AWS, Azure, or GCP).
Strong knowledge of security and compliance in enterprise environments, with a focus on access management, data isolation, and customer-specific VPC setups.
Proficiency in Python or JavaScript/TypeScript, and SQL.
Preferred
Hands-on experience and a passion for working with Agents, LLMs, vector databases, and other emerging AI technologies.