About the job
The Seed Multimodal Interaction and World Model team is dedicated to developing models that have boast human-level multimodal understanding and interaction capabilities. The team also aspires to advance the exploration and development of multimodal assistant products
Responsibilities
- Research and development large-scale multimodal foundation models
- Develop unified modeling frameworks that integrate video, audio, and language, with a focus on visual latent reasoning
- Explore Reinforcement Learning-based approaches to bridge understanding and generation for multimodal visual reasoning
- Collaborate with researchers to evaluate models on tasks involving world modeling, reasoning, and instruction-conditioned generation
Qualifications
Minimum
- Master's or PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline
- Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences
- Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling
Preferred
- Experience with reinforcement learning in multimodal or interactive environments
- Familiarity with video generation or diffusion-based generative models
- Experience with large-scale model training
- Solid programming and engineering skills, with experience building training or evaluation pipelines for ML models