Member of Technical Staff - ML Research Engineer, Data

About the job

Our Data team powers Liquid Foundation Models across pre-training, vision, audio, and emerging modalities. Public data sources are plateauing. Model performance increasingly depends on purpose-built datasets. We need ML-minded engineers who can collect, filter, and synthesize high-quality data at scale. We treat data as a research problem, not an infrastructure problem. Our engineers run experiments, design ablations, and measure how data decisions move model quality. We will match you to the team where you can grow the fastest and have the most impact: pre-training, post-training RL, vision-language, audio, or multimodal.

Responsibilities

Build and maintain data processing, filtering, and selection pipelines at scale

Create pipelines for pretraining, midtraining, SFT, and preference optimization datasets

Design synthetic data generation systems using LLMs, structured prompting, and domain-specific generators

Design and run evaluations and ablations to measure dataset's impact on model performance

Monitor public datasets across text, vision, and audio domains

Collaborate with pre-training, vision, and audio teams on modality-specific data needs

Qualifications

Minimum

Strong Python skills with the ability to quickly comprehend problems and translate them into clean, working code

Solid ML fundamentals: experience training, evaluating, and iterating on models (PyTorch preferred)

Track record of learning new technical domains quickly

3+ years relevant experience with an M.S., or 1+ year with a Ph.D. (5+ years with a B.S.)

Preferred

Experience with synthetic data generation, data curation, or ML evaluation (designing evals, benchmarking, measuring data and model quality)

Experience with LLMs, VLMs, computer vision, or audio data pipelines

Open-source contributions or publications at NeurIPS, ICML, ICLR, or CVPR