About the job
As part of the work on machine-generated dialog we are developing novel measurements of its quality. These include cutting-edge llm-judges for aspects like groundedness (lack of hallucinations), Siri Tone and Style (a suite of Design requirements), Safety, and others.
Responsibilities
Build an easy to use dashboard for our datasets - requires integration with other teams
Build dashboards to visualize our status
Help define useful metrics
Build tools to facilitate processing of human review results (inter-annotator agreement, storage, selection of the most useful data points for human expert review)
Build tools and pipelines to streamline quality measurement processes
Qualifications
Minimum
M.S. or Ph.D. in Computer Science, Data Science, Data Engineering
3+ years in data-science and/or data-engineering (iceberg, pandas python, Tableau or equivalent, data collection and visualization)
2+ years of python coding
Good understanding of metrics, crowd science, annotation analysis, statistics
Ability to work independently and cross-functionally to integrate in partner team reporting systems and pipelines
Excellent communication skills and the ability to thrive in a highly collaborative work environment
Good engineering practices to create sustainable and easy to use metric reporting pipelines
Preferred
Experience with writing and architecting production level code
Deep understanding of Machine Learning concepts
Experience in Model training and/or evaluation
Good engineering practices to create sustainable and easy to use metric reporting pipelines
Attunement to computational linguistics, language quality is a plus