About the job
Our AI engineers make a real impact on the safety and ROI of large language models and agentic applications across different verticals and domains. You will work on the cutting edge of envisioning and building new types of tools and algorithms to monitor, explain, and improve such applications and in turn empower our customers.
Responsibilities
Design and build core services and components of a world-class cloud platform to help enterprises develop, monitor and improve their full suite of AI based applications (covering predictive models, LLMs, GenAI models and agentic applications)
Lead the design and implementation of distributed systems, microservices and applications that compute, persist, and expose new ML + agentic observability metrics (e.g., response relevancy, hallucination scores) from raw trace data
Spearhead the development of new types of metrics and evaluation capabilities to satisfy evolving customer needs around agentic applications. Take part in conversations with customers around discovery and support
Developer in-house AI Agents and GenAI capabilities to augment the Fiddler observability products
Define and evolve the operational maturity (reliability, observability, SLOs, observability) of core services and components, establish best practices and champion improvements across the team
Team & Culture Building: you will take an active role in building a world-class engineering team and actively participate in the talent acquisition process through interviewing, candidate evaluation and coaching
Qualifications
Minimum
Masters or Bachelors degree in Computer Science or related field, combined with 7+ years of industry experience, with demonstrated solid foundation in software development
Deep proficiency with Python
Prior experience building and operating highly complex SaaS platforms and systems
Work closely with product, design, and customer engineering teams to improve and expand our product offerings
Preferred
Practical working knowledge of machine learning and data science
Experience with OLAP systems, Postgres, Redis, Kafka, RabbitMQ, Ray and Spark
Familiarity with frameworks like Langchain, LangGraph, Google ADK, Amazon Strands, OpenAI, etc.
Experience building and integrating MCP Servers
Hands-on experience designing, implementing, or operating AI/ML observability or evaluation systems
Coaching & Mentorship: Serve as a strong collaborator and a mentor to other team members, raising the technical bar for the entire team and regularly engage in code and design reviews
Ability to work in our Palo Alto office 3 days a week