Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok
San Jose, California

About the job

We are building next-generation generative foundation models, with a strong focus on diffusion-based and unified generation-understanding architectures, deployed in privacy-sensitive, production environments. This role sits at the intersection of - Large-scale model training systems - GPU-first architecture and kernel-level optimization - Diffusion / DiT / unified multimodal foundation models - Privacy-preserving and compliant training pipelines You will work on end-to-end training architecture design, from model-parallel execution and GPU efficiency to robust, fault-tolerant, privacy-aware training infrastructure.

Responsibilities

Develop a deep understanding of and optimize DiT + Flow Matching / Rectified Flow–based generative models

Lead or contribute to the design and implementation of: Diffusion Transformer (DiT / MM-DiT) architecture improvements; Unified text-to-image / text-to-video model designs; Latent space, tokenization, and conditioning mechanisms.

Perform joint algorithmic and system-level optimization, targeting: Training stability and convergence speed; Memory and compute efficiency; Generation quality and consistency

Address challenges in long-sequence, high-resolution, and video generation, including: Efficient attention and temporal modeling strategies; Long-context and long-latent modeling

Collaborate closely with systems and kernel engineers to map model designs to efficient implementations

Reproduce, analyze, and advance state-of-the-art generative models (beyond simple replication)

Qualifications

Minimum

Currently pursuing PhD in Computer science, computer engineering, or a related technical discipline.

Deep understanding of Diffusion / Flow Matching / Rectified Flow

Strong familiarity with DiT / Transformer-based architectures in generative modeling

Ability to debug the full pipeline from mathematical formulation → code → training → generated outputs

Proficiency with PyTorch and hands-on experience training large-scale models

Preferred

Practical experience with text-to-image or text-to-video models (non-toy systems)

Familiarity with multimodal modeling (Text / Image / Video / Audio)

Research publications or open-source contributions