About the job
Google Home is transitioning from a reactive tool, driven by voice commands and simple logic, to a proactive, autonomous agent powered by Gemini. The Home Quality and Intelligence Data Science team is seeking a Senior Data Scientist to support this transition. In this role, you will collaborate with a team of executive practitioners to build the evaluation frameworks that quantify "intelligence" and "reliability" in non-deterministic environments. You will interpret how product and engineering choices impact the home experience, providing the data signal needed to drive decisions in the technical uncertainty. You will also contribute to advancing agentic data science within the team — using AI to automate our internal workflows and scale our collective impact.
Responsibilities
Interpret our collection of automated and human quality metrics as indicators of overall product health, identifying high-impact headroom opportunities — for example, combining autorater scores and user telemetry to pinpoint where the Gemini agent needs improvement.
Advocate for a culture of metric-informed decision-making, experimentation, and high-quality data modeling.
Act as a go-to expert within the team on specific data science methodologies related to AI evaluation.
Build and prototype analysis and business cases iteratively to provide insights at scale.
Develop comprehensive knowledge of Google data structures and metrics, advocating for changes where needed for product development.
Stay abreast of the latest advancements in AI evaluation, data science, and agentic AI, and apply them to improve our team's practices.
Qualifications
Minimum
Bachelor's degree in Statistics, Mathematics, Data Science, Engineering, Physics, Economics, or a related quantitative field.
8 years of experience using analytics to solve product or business problems, performing statistical analysis, and coding (e.g., Python, R, SQL) or 5 years of experience with a Master's degree.
Experience in causal inference analysis.
Preferred
Master's degree in Statistics, Mathematics, Data Science, Engineering, Physics, Economics, or a related quantitative field.
Experience in large language model (LLM) quality assessment and evaluation, including human evals and LLM-as-judge autoraters.