Principal Machine Learning Engineer

About the job

We are seeking an exceptional Principal Machine Learning Engineer to join our organization at the forefront of applied AI. This is a senior individual contributor role designed for a practitioner who is equally at home architecting large-scale LLM infrastructure, building scalable python backend APIs, and driving organization wide AI transformation. You will be designing and delivering Generative AI and Agentic AI systems, setting engineering standards for building production grade ML applications, and mentor engineering teams across the organization. You will play a critical role in leading S&P’s AI-driven transformation to drive value internally and for our customers.

Responsibilities

LLM & Generative AI Engineering: Deploy and architect production-scale LLM systems spanning frontier models (GPT-4 class), open-source variants (such as LLaMA, Mistral, Gemma), RAG pipelines, and multi-modal AI systems incorporating text, code, images, and structured data.

Agentic AI Systems: Design and operationalize autonomous AI agents with multi-agent orchestration, tool-use capabilities, memory management, and enterprise-grade guardrails and observability strategies.

Python & Software Engineering: Write high-performance Python code following SOLID principles, lead code reviews, build reusable AI libraries, and implement rigorous testing and CI/CD practices across all ML workloads

Cloud & Distributed Systems: Architect cloud-native AI infrastructure with GPU cluster management, auto-scaling inference endpoints, vector databases, and cost-optimized distributed systems for high-throughput model serving, leveraging managed AI services (such as Bedrock, Azure OpenAI, Vertex AI) alongside self-hosted deployments (such as vLLM, TGI).

Backend APIs & Systems Integration: Design production-grade RESTful and asynchronous APIs (similar to FastAPI, gRPC) exposing AI capabilities, integrate LLM services with enterprise systems, and own end-to-end performance, reliability, and security from design through production MLOps & LLMOps: Implement comprehensive ML pipelines for training through monitoring tools (similar to MLflow, Kubeflow, SageMaker), establish prompt versioning and model governance practices, and instrument production systems with observability across performance and quality metrics

DevOps & Platform Engineering: Embed AI workloads into CI/CD pipelines, champion containerization (such as Docker, Kubernetes, Helm) and GitOps workflows, define SRE practices for ML systems, and drive platform standardization for self-service AI capabilities Organization-Wide AI Transformation: Advise engineering, product and business leadership on AI strategy and build-vs-buy decisions, evaluate third-party tooling, define transformation KPIs, and partner with governance teams to establish responsible AI policies and regulatory frameworks.

Qualifications

Minimum

10+ years of progressive experience, with 8+ years in data science, data analytics, machine learning engineering, or similar roles. Proven ability to translate complex technical concepts for non-technical audiences with clarity and impact. Experience defining technical roadmaps, architecture decision records (ADRs), and engineering standards adopted across multiple teams. History of mentoring senior and mid-level engineers, conducting effective technical interviews, and raising the organizational engineering bar. LLM Frameworks: Extensive knowledge and experience in tools similar to LangChain, LlamaIndex, LangGraph, Hugging Face Transformers, PEFT, vLLM, Ollama, or equivalent production-grade tooling. MLOps Tooling: Extensive knowledge and experience in tools similar to MLflow, SageMaker, Vertex AI, or Kubeflow — with a bias toward automation and repeatability. Cloud Platforms: Deep expertise in cloud platforms such as AWS, GCP, or Azure. Python: Expert-level proficiency including async programming, performance optimization, type systems, packaging, and internal library authorship. Databases & Storage: Vector databases (similar to Pinecone, OpenSearch, Chroma), relational (such as PostgreSQL), NoSQL (such as Redis, DynamoDB), and object storage. Containerization & Orchestration expertise in environments similar to Docker, Kubernetes, Helm. Backend Development: Expertise in engineering in environments similar to FastAPI, REST design principles, async patterns, OAuth2/JWT, and API security best practices. Distributed Systems: Experience with message queues (similar to Kafka, SQS), event streaming, microservices design patterns.

Preferred

MS in Computer Science, Machine Learning, Engineering, or a related quantitative field. Published open-source contributions in the environments such as LLM, GenAI, or NLP space. Experience operating in regulated industries (finance, healthcare, legal) with AI compliance, auditability, and risk management requirements. Contributions to enterprise AI governance frameworks, model risk management programs, or responsible AI practices development. Cloud AI certifications: AWS ML Specialty, GCP Professional ML Engineer, Azure AI Engineer Associate, or equivalent.