About the job
Agent Dev Velocity builds the tooling and evaluation backbone that helps Notion ship high-quality AI faster and more safely. We build the infrastructure that makes AI evaluations easy to create, cheap to run, and hard to ignore, so engineers across the AI org can iterate with confidence. In this role, you will work at the intersection of developer tooling, distributed systems, and measurement. You will build systems for running and maintaining evals at scale, and you will help create durable benchmarks and datasets that keep us honest about quality over time. You will help evolve evals into a system, by enabling reusable eval workspaces and data-driven workflows that surface issues through data mining and continuous measurement.
Responsibilities
- Build and improve scalable eval runners and harnesses that work locally, in CI, and on scheduled runs.
- Make it easy for engineers to add high-signal evals: better templates, fixtures, debugging tools, and clear workflows.
- Build and maintain benchmark and dataset tooling (curation pipelines, versioning, artifact management, and regression tracking).
- Improve reliability and observability for eval execution (retries, idempotency, cost and latency visibility, and failure triage).
- Partner closely with AI product, AI platform, and infrastructure teams to integrate evals into day-to-day shipping workflows.
Qualifications
Minimum
- Strong software engineering fundamentals and experience shipping production systems.
- Proficiency with TypeScript/Node and/or Python.
- Experience building reliable systems in distributed environments (queues, retries, idempotency, and backfills).
- Comfort working with data pipelines (batch processing, data quality, versioning, and reproducibility).
- Practical experience designing measurement or evaluation systems (LLM eval experience is a plus, but strong testing and benchmarking instincts also apply).
Preferred
- Experience building developer tooling (CLI tools, CI integrations, or internal platforms).
- Familiarity with LLM evaluation techniques (rubrics, human review loops, dataset curation, and regression detection).
- Experience collaborating across teams to roll out new workflows and drive adoption.