Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model)

About the job

We are building next-generation generative foundation models, with a strong focus on diffusion-based and unified generation-understanding architectures, deployed in privacy-sensitive, production environments. This role sits at the intersection of Large-scale model training systems, GPU-first architecture and kernel-level optimization, Diffusion / DiT / unified multimodal foundation models, Privacy-preserving and compliant training pipelines. You will work on end-to-end training architecture design, from model-parallel execution and GPU efficiency to robust, fault-tolerant, privacy-aware training infrastructure.

Responsibilities

1. Develop a deep understanding of and optimize DiT + Flow Matching / Rectified Flow–based generative models

2. Lead or contribute to the design and implementation of: Diffusion Transformer (DiT / MM-DiT) architecture improvements; Unified text-to-image / text-to-video model designs; Latent space, tokenization, and conditioning mechanisms.

3. Perform joint algorithmic and system-level optimization, targeting: Training stability and convergence speed; Memory and compute efficiency; Generation quality and consistency

4. Address challenges in long-sequence, high-resolution, and video generation, including: Efficient attention and temporal modeling strategies; Long-context and long-latent modeling

5. Collaborate closely with systems and kernel engineers to map model designs to efficient implementations

6. Reproduce, analyze, and advance state-of-the-art generative models (beyond simple replication)

Qualifications

Minimum

1. Currently pursuing PhD in Computer science, computer engineering, or a related technical discipline.

2. Deep understanding of Diffusion / Flow Matching / Rectified Flow

3. Strong familiarity with DiT / Transformer-based architectures in generative modeling

4. Ability to debug the full pipeline from mathematical formulation → code → training → generated outputs

5. Proficiency with PyTorch and hands-on experience training large-scale models

Preferred

1. Practical experience with text-to-image or text-to-video models (non-toy systems)

2. Familiarity with multimodal modeling (Text / Image / Video / Audio)