AI/ML Infrastructure Software Development Engineer

About the job

To achieve an organization’s mission, leaders need strong team members who can create and analyze processes, communicate requirements, and develop innovative solutions throughout the execution of the mission. Whether reviewing program-wide technical architecture or providing AI/ML infrastructure expertise, our clients need someone who combines deep technical understanding of software engineering with strong architectural judgment. That is why we need you, an experienced AI/ML Software Development Engineer who can operate at a system-of-systems level to support clients in advancing AI-enabled systems within an R&D environment.

Responsibilities

Own and operate all backend and infrastructure components for an AI/ML model on Azure, including compute, APIs, identity, data layers, and IaC-driven environments

Build and maintain resilient CI/CD, deployment automation, secrets management, and production-grade fundamentals, including monitoring, alerting, logging, tracing, SLOs, and incident response

Manage cost and token economics across all LLM providers, analyzing budgets, guardrails, and optimizations for cost-per-query

Lead agentic and protocol infrastructure, including MCP backend implementation, tool-calling systems, and reliable A2A communication patterns

Design and evolve LLM orchestration, multi-model routing, and robust fallback and degradation patterns across GPT, Claude, and Gemini

Build and operate RAG and knowledge pipelines, including ingestion, indexing, embedding, semantic

Qualifications

Minimum

7+ years of experience with software engineering, including building and operating production systems

Experience being on-call, debugging incidents, and writing postmortems

Experience in high-velocity environments where you owned and shipped complex products end-to-end

Experience with at least 2 backend languages, including Python

Experience with Microsoft Azure, including Azure Functions, API Management, Container Apps, and Azure OpenAI Service

Experience with containerization, CI/CD, and infrastructure as Code

Knowledge of modern backend frameworks, async patterns, distributed systems, APIs, data pipelines, and software design patterns

Knowledge of authentication and identity systems, such as OAuth2, OIDC, or Azure Entra ID

Ability to own production systems

Bachelor's degree in Computer Science or Software Engineering

Preferred

Experience in healthcare, life sciences, or other regulated domains

Experience in security-conscious engineering, including input validation, output sanitization, audit logging, and responsible AI guardrails

Experience in startup or early-stage environments, such as 0-to-1 product building

Experience implementing A2A communication patterns and multi-agent orchestration frameworks

Experience building on top of LLMs in production, including tool-calling, RAG, multi-step reasoning, multi-model routing, and context window management

Experience managing multi-provider LLM integrations, including rate limits, fallback routing, and API versioning

Experience in security-conscious engineering in regulated or government environments

Ability to be a self-starter and operate within a fast-paced environment

Ability to be comfortable with ambiguity and a high sense of urgency

Master’s degree in a relevant field