Research Scientist - Seed Multimodal Interaction and World Model

About the job

Established in 2023, the ByteDance Seed team is dedicated to pioneering new paths toward artificial general intelligence. We aspire to advance the frontier of intelligence to drive progress for both technology and society. With a long-term vision for the AI sector, the Seed team's research spans MLLM, GenMedia, AI for Science, and Robotics. We maintain a global presence with laboratories and career opportunities across China, Singapore, and the United States. To date, we have launched industry-leading general foundation models and cutting-edge multimodal capabilities. Our technology powers over 50 application scenarios — including Doubao, Jimeng, TRAE, Dola and Dreamnia — and serves enterprise customers through Volcano Engine and BytePlus. Third-party data shows that the Doubao App ranks first in user volume in the Chinese market, while Doubao foundation models lead the industry in average daily token consumption. The Seed Multimodal Interaction and World Model team is dedicated to developing models that have boast human-level multimodal understanding and interaction capabilities. The team also aspires to advance the exploration and development of multimodal assistant products

Responsibilities

- Research and development large-scale multimodal foundation models

- Develop unified modeling frameworks that integrate video, audio, and language, with a focus on visual latent reasoning

- Explore Reinforcement Learning-based approaches to bridge understanding and generation for multimodal visual reasoning

- Collaborate with researchers to evaluate models on tasks involving world modeling, reasoning, and instruction-conditioned generation

Qualifications

Minimum

- Master's or PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline

- Publications in accredited venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences

- Strong research background in at least one of the following: reinforcement learning, multimodal learning, video understanding, or vision-language modeling

Preferred

- Experience with reinforcement learning in multimodal or interactive environments

- Familiarity with video generation or diffusion-based generative models

- Experience with large-scale model training

- Solid programming and engineering skills, with experience building training or evaluation pipelines for ML models