Member of Technical Staff - Multi-Modal, Vision

Liquid AI
San Francisco2025-10-23Hybrid

About the job

The VLM team builds vision-language models that run on-device, under tight latency and memory constraints, without sacrificing quality. We have released four best-in-class models and we're just getting started. This team owns the full VLM pipeline end-to-end: from researching new architectures and training algorithms through data curation, evaluation, and deployment. You'll join a focused, hands-on group that works directly on models and collaborates closely with our pretraining, post-training, and infrastructure teams. Success here is measured by the capability of the models we ship.

Responsibilities

- Lead a new model capability end-to-end from task spec through data curation, training recipe, ablations, evaluation, and into the final shipped model.

- Improve visual reasoning through reinforcement learning and preference optimization methods.

- Push the quality-efficiency frontier on token efficiency via encoder/connector design. Exemplary outcome: a connector that cuts vision tokens without quality loss.

Qualifications

Minimum

- Hands-on experience in training or evaluating VLMs with demonstrated experimental rigor.

- Ability to turn research ideas into scalable implementations, refine and iterate through hypotheses.

- Proficiency in Python and at least one deep learning framework.

- M.S. or Ph.D. in Computer Science, Mathematics, or a related field; or equivalent industry experience.

Preferred

- Building or optimizing multimodal training or data pipelines.

- Experience with distributed training (DeepSpeed, FSDP, Megatron-LM, etc.).

- Multimodal post-training experience (SFT, preference optimization, RL-style methods).

- Dataset design and data quality expertise (quality and diversity assessment, long-tail mining).

- Prior open-source contributions (code, data, models) on GitHub or Hugging Face.

- Published research at top AI conferences (NeurIPS, ICML, CVPR, ECCV, ICLR, ACL, etc.).

- Experience with computer vision or visual representation learning.