Agent Evaluation Engineer

About the job

The Agent Evaluation team is responsible for testing whether AI agents return the correct and expected responses. We build the framework, metrics, and test cases that validate agent behavior, accuracy, and reliability before release. Our goal is to ensure agents perform consistently and meet product and user expectations.

Responsibilities

Design and develop agent evaluation pipelines across development, staging, and production environments

Define and standardize evaluation metrics and benchmarks for conversational AI quality (accuracy, relevance, CX, safety)

Build automated and human-in-the-loop evaluation systems to assess agent performance

Manage and curate evaluation datasets, test sets, and annotation workflows

Enable continuous evaluation and monitoring of agents in production

Integrate evaluation into CI/CD pipelines to support safe and efficient releases

Conduct experiments, A/B testing, and case studies to drive improvements in agent quality

Partner with engineering, and product teams to deliver high-quality AI solutions

Create technical documentation and drive best practices across teams

Mentor junior engineers and contribute to team growth

Qualifications

Minimum

No minimum qualifications listed.

Preferred

Experience in customer support AI or chatbot platforms

Understanding of responsible AI (bias, fairness, hallucination mitigation)