Data Engineer

About the job

At xAI, we are building AI systems that push the frontier of human knowledge and scientific discovery. High-quality data is fundamental to every stage of that mission. Our Data team is responsible for ensuring that the models are trained on the right data, in the right form, at the right quality, across every phase of the training lifecycle. This includes partnering closely with acquisition teams to identify where valuable data can be sourced, determining what data is needed to improve model performance, and building the production pipelines and systems that transform raw inputs into high-quality training data at scale. We work at the intersection of data, infrastructure, and machine learning to ensure our models train effectively and reliably.

Responsibilities

Analyze the performance and impact of data used throughout the model training lifecycle

Investigate anomalous model behavior and rigorously identify the data issues that drive poor downstream performance

Design, build, and improve the data cleaning, transformation, and quality-control steps required to produce high-quality training data

Research, evaluate, and develop frontier methods for improving data quality and effectiveness in AI model development

Apply statistical techniques and empirical analysis to make informed, data-driven decisions about dataset quality and model outcomes

Partner across teams to identify where data needs exist and define the highest-impact opportunities for new data acquisition and improvement

Build and maintain production-grade data pipelines, tooling, and software systems that ingest, process, validate, and deliver data for training

Develop metrics, evaluation frameworks, and monitoring systems to assess how data quality influences model behavior at scale

Fuse data from multiple sources into reliable, usable datasets for research and production model training

Create shared datasets, tooling, and internal data products that enable other teams to analyze, debug, and improve model performance

Qualifications

Minimum

Bachelor’s degree in computer science, data science, physics, mathematics, or a STEM discipline

1+ years of data/software engineering experience (internship experience is applicable)

Experience in implementing or analyzing language models or neural networks

Preferred

Professional experience in analytics, data science, machine learning, or data engineering

Experience building and operating production data pipelines for neural network or large-scale machine learning workloads

Strong experience with Python and the broader ecosystem of libraries and tools used in modern machine learning and data development

Experience working with Parquet or similar columnar storage formats in large-scale data systems

Familiarity with Kubernetes and distributed production environments

Experience developing predictive models and machine learning pipelines, including clustering, forecasting, anomaly detection, or related techniques

Experience working with very large-scale datasets, including terabyte- to petabyte-scale data systems

Strong statistical intuition and the ability to use quantitative analysis to guide technical and product decision, including familiarity of scaling ladder design studies

Ability to operate effectively in a dynamic environment with evolving priorities, changing requirements, and fast-moving technical challenges

Demonstrated ability to take ownership of ambiguous problems, drive projects independently, and develop new expertise where needed