About the job
The Agent Evaluation team is responsible for testing whether AI agents return the correct and expected responses. We build the framework, metrics, and test cases that validate agent behavior, accuracy, and reliability before release. Our goal is to ensure agents perform consistently and meet product and user expectations.
Responsibilities
Design and develop agent evaluation pipelines across development, staging, and production environments
Define and standardize evaluation metrics and benchmarks for conversational AI quality (accuracy, relevance, CX, safety)
Build automated and human-in-the-loop evaluation systems to assess agent performance
Manage and curate evaluation datasets, test sets, and annotation workflows
Enable continuous evaluation and monitoring of agents in production
Integrate evaluation into CI/CD pipelines to support safe and efficient releases
Conduct experiments, A/B testing, and case studies to drive improvements in agent quality
Partner with engineering, and product teams to deliver high-quality AI solutions
Create technical documentation and drive best practices across teams
Mentor junior engineers and contribute to team growth
Qualifications
Minimum
No minimum qualifications listed.
Preferred
Experience in customer support AI or chatbot platforms
Understanding of responsible AI (bias, fairness, hallucination mitigation)