About the job
The VLM team builds vision-language models that run on-device, under tight latency and memory constraints, without sacrificing quality. We have released four best-in-class models and we're just getting started. This team owns the full VLM pipeline end-to-end: from researching new architectures and training algorithms through data curation, evaluation, and deployment. You'll join a focused, hands-on group that works directly on models and collaborates closely with our pretraining, post-training, and infrastructure teams. Success here is measured by the capability of the models we ship.
Responsibilities
- Lead a new model capability end-to-end from task spec through data curation, training recipe, ablations, evaluation, and into the final shipped model.
- Improve visual reasoning through reinforcement learning and preference optimization methods.
- Push the quality-efficiency frontier on token efficiency via encoder/connector design. Exemplary outcome: a connector that cuts vision tokens without quality loss.
Qualifications
Minimum
- Hands-on experience in training or evaluating VLMs with demonstrated experimental rigor.
- Ability to turn research ideas into scalable implementations, refine and iterate through hypotheses.
- Proficiency in Python and at least one deep learning framework.
- M.S. or Ph.D. in Computer Science, Mathematics, or a related field; or equivalent industry experience.
Preferred
- Building or optimizing multimodal training or data pipelines.
- Experience with distributed training (DeepSpeed, FSDP, Megatron-LM, etc.).
- Multimodal post-training experience (SFT, preference optimization, RL-style methods).
- Dataset design and data quality expertise (quality and diversity assessment, long-tail mining).
- Prior open-source contributions (code, data, models) on GitHub or Hugging Face.
- Published research at top AI conferences (NeurIPS, ICML, CVPR, ECCV, ICLR, ACL, etc.).
- Experience with computer vision or visual representation learning.