About the job
Our Data team powers Liquid Foundation Models across pre-training, vision, audio, and emerging modalities. Public data sources are plateauing. Model performance increasingly depends on purpose-built datasets. We need ML-minded engineers who can collect, filter, and synthesize high-quality data at scale. We treat data as a research problem, not an infrastructure problem. Our engineers run experiments, design ablations, and measure how data decisions move model quality. We will match you to the team where you can grow the fastest and have the most impact: pre-training, post-training RL, vision-language, audio, or multimodal.
Responsibilities
Build and maintain data processing, filtering, and selection pipelines at scale
Create pipelines for pretraining, midtraining, SFT, and preference optimization datasets
Design synthetic data generation systems using LLMs, structured prompting, and domain-specific generators
Design and run evaluations and ablations to measure dataset's impact on model performance
Monitor public datasets across text, vision, and audio domains
Collaborate with pre-training, vision, and audio teams on modality-specific data needs
Qualifications
Minimum
Strong Python skills with the ability to quickly comprehend problems and translate them into clean, working code
Solid ML fundamentals: experience training, evaluating, and iterating on models (PyTorch preferred)
Track record of learning new technical domains quickly
3+ years relevant experience with an M.S., or 1+ year with a Ph.D. (5+ years with a B.S.)
Preferred
Experience with synthetic data generation, data curation, or ML evaluation (designing evals, benchmarking, measuring data and model quality)
Experience with LLMs, VLMs, computer vision, or audio data pipelines
Open-source contributions or publications at NeurIPS, ICML, ICLR, or CVPR