About the job
This is a rare chance to sit at the intersection of frontier vision-language models and real-world deployment. You'll own applied post-training work for VLMs end-to-end for some of the world's largest enterprises, while still contributing directly to Liquid's core multimodal model development. Unlike most roles that force a trade-off between customer impact and foundational work, this role gives you both: deep ownership over how vision-language models are adapted, evaluated, and shipped, and a direct line into the evolution of Liquid's multimodal post-training stack.
Responsibilities
Act as the technical owner for enterprise customer VLM post-training engagements.
Translate customer requirements into concrete multimodal post-training specifications and workflows.
Design and execute visual data generation, filtering, and quality assessment processes, including image-text pair curation, annotation pipelines, and synthetic data generation for visual tasks.
Run supervised fine-tuning, preference alignment, and reinforcement learning workflows for vision-language models.
Design task-specific evaluations for visual understanding, grounding, OCR, document parsing, and other multimodal capabilities. Interpret results and feed learnings back into core post-training pipelines.
Qualifications
Minimum
Hands-on experience with data generation and evaluation for VLM or multimodal post-training.
Experience training or fine-tuning vision-language models using SFT, preference alignment, and/or RL.
Strong intuition for visual data quality, annotation design, and multimodal evaluation.
Familiarity with vision encoders, image-text architectures, and how visual representations interact with language model backbones.
Preferred
Experience with visual grounding, document understanding, OCR, or video understanding tasks.
Experience contributing to shared or general-purpose multimodal post-training infrastructure.
Prior exposure to customer-facing or applied ML delivery environments.
Familiarity with alignment or RL techniques beyond basic supervised fine-tuning in the multimodal setting.