About the job
In this role, you will work cross-functionally with Researchers and Engineers on Agent Evals and Quality to ensure that we have the best quality of agents for GenAI model improvement and product developments. You will collaborate with agent platform and model teams to leverage user signals and metrics to improve model performance. focusing on the development, evaluation, and optimization of AI agentic systems—specificly LLM-based agents designed to perform complex, multi-step tasks and workflows.
Responsibilities
Construct quantitative benchmarks and automated evaluation frameworks (including LLM-as-a-judge) to measure agent capabilities in reasoning, planning, and tool use.
Create and optimize data mixes extracted from user feedback for training, fine-tuning agents to enhance performance on specific tool-use tasks.
Analyze agent behavior to identify failure modes, edge cases, and performance bottlenecks, turning these insights into actionable improvements.
Qualifications
Minimum
Bachelor’s degree or equivalent practical experience.
8 years of experience in software development.
5 years of experience with design and architecture; and testing/launching software products.
Preferred
Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
8 years of experience with data structures and algorithms.
5 years of experience in a technical leadership role leading project teams and setting technical direction.
3 years of experience working in a complex, matrixed organization involving cross-functional, or cross-business projects.