Staff Data Platform Engineer (Hybrid)

About the job

Our Staff Data Platform Engineers make a real impact on the safety and ROI of large language models and agentic applications across different verticals and domains. You will work on the cutting edge of envisioning and building new types of tools and algorithms to monitor, explain, and improve such applications and in turn empower our customers.

Responsibilities

Design and build core services and components of a world-class cloud platform to help enterprises develop, monitor and improve their full suite of AI based applications (covering predictive models, LLMs, GenAI models and agentic applications)

Lead the design and implementation of distributed systems and microservices that compute, persist, and expose new ML + agentic observability metrics (e.g., response relevancy, hallucination scores) from raw trace data

Design enterprise-grade, scalable data infrastructure, services and APIs to support enterprise scale workloads and meet compliance needs and SLAs

Spearhead the development of new types of metrics and evaluation capabilities to satisfy evolving customer needs. Take part in conversations with customers around discovery and support

Define and evolve the operational maturity (reliability, latency, SLOs, observability) of core services, establish best practices and champion improvements to internal CI/CD processes, testing frameworks, error handling, efficiency and resiliency

Team & Culture Building: you will take an active role in building a world-class engineering team and actively participate in the talent acquisition process through interviewing, candidate evaluation and coaching

Qualifications

Minimum

Masters or Bachelors degree in Computer Science or related field, combined with 7+ years of industry experience, with demonstrated solid foundation in software development.

Deep proficiency with Python and a strong command of essential backend technologies like Postgres, Redis, Kafka, RabbitMQ, Ray. This includes the ability

Preferred

Experience with cloud infrastructure (AWS/GCP, Kubernetes) and specialized databases (Clickhouse/Druid), indicating a deeper understanding of system architecture and performance optimization.