About the job
To achieve an organization’s mission, leaders need strong team members who can create and analyze processes, communicate requirements, and develop innovative solutions throughout the execution of the mission. Whether reviewing program-wide technical architecture or providing AI/ML infrastructure expertise, our clients need someone who combines deep technical understanding of software engineering with strong architectural judgment. That is why we need you, an experienced AI/ML Software Development Engineer who can operate at a system-of-systems level to support clients in advancing AI-enabled systems within an R&D environment.
Responsibilities
Own and operate all backend and infrastructure components for an AI/ML model on Azure, including compute, APIs, identity, data layers, and IaC-driven environments
Build and maintain resilient CI/CD, deployment automation, secrets management, and production-grade fundamentals, including monitoring, alerting, logging, tracing, SLOs, and incident response
Manage cost and token economics across all LLM providers, analyzing budgets, guardrails, and optimizations for cost-per-query
Lead agentic and protocol infrastructure, including MCP backend implementation, tool-calling systems, and reliable A2A communication patterns
Design and evolve LLM orchestration, multi-model routing, and robust fallback and degradation patterns across GPT, Claude, and Gemini
Build and operate RAG and knowledge pipelines, including ingestion, indexing, embedding, semantic
Qualifications
Minimum
7+ years of experience with software engineering, including building and operating production systems
Experience being on-call, debugging incidents, and writing postmortems
Experience in high-velocity environments where you owned and shipped complex products end-to-end
Experience with at least 2 backend languages, including Python
Experience with Microsoft Azure, including Azure Functions, API Management, Container Apps, and Azure OpenAI Service
Experience with containerization, CI/CD, and infrastructure as Code
Knowledge of modern backend frameworks, async patterns, distributed systems, APIs, data pipelines, and software design patterns
Knowledge of authentication and identity systems, such as OAuth2, OIDC, or Azure Entra ID
Ability to own production systems
Bachelor's degree in Computer Science or Software Engineering
Preferred
Experience in healthcare, life sciences, or other regulated domains
Experience in security-conscious engineering, including input validation, output sanitization, audit logging, and responsible AI guardrails
Experience in startup or early-stage environments, such as 0-to-1 product building
Experience implementing A2A communication patterns and multi-agent orchestration frameworks
Experience building on top of LLMs in production, including tool-calling, RAG, multi-step reasoning, multi-model routing, and context window management
Experience managing multi-provider LLM integrations, including rate limits, fallback routing, and API versioning
Experience in security-conscious engineering in regulated or government environments
Ability to be a self-starter and operate within a fast-paced environment
Ability to be comfortable with ambiguity and a high sense of urgency
Master’s degree in a relevant field