Research Scientist Intern (TikTok-Privacy Innovation Lab-GPU Systems & Model Optimization)

About the job

We are building next-generation generative foundation models, with a strong focus on diffusion-based and unified generation-understanding architectures, deployed in privacy-sensitive, production environments. This role sits at the intersection of Large-scale model training systems, GPU-first architecture and kernel-level optimization, Diffusion / DiT / unified multimodal foundation models, and Privacy-preserving and compliant training pipelines. You will work on end-to-end training architecture design, from model-parallel execution and GPU efficiency to robust, fault-tolerant, privacy-aware training infrastructure.

Responsibilities

Design and implement high-performance GPU kernels for core components such as: Transformer / Attention / MoE / Diffusion

Perform end-to-end optimization for large model training workloads

Conduct in-depth analysis of GPU execution bottlenecks, including compute, memory, and scheduling

Use and extend Triton / CUDA / CUTLASS, and integrate optimized kernels with PyTorch / XLA / custom runtimes

Collaborate closely with model research teams to: Translate new model architectures into efficient, production-ready implementations

Reproduce, benchmark, and improve state-of-the-art system optimization techniques, validating gains in real training and inference settings

Qualifications

Minimum

Currently pursuing PhD in Computer science, computer engineering, or a related technical discipline.

Solid understanding of GPU architecture and execution models

Proficiency in CUDA C++ or Triton, with the ability to independently write and optimize kernels

Strong familiarity with Transformer / Attention computation patterns and performance bottlenecks

Ability to read, reproduce, and reason about systems papers or open-source implementations

Preferred

Hands-on experience with large-scale model training

Familiarity with PyTorch internals (e.g., Autograd, dispatcher, ATen)

Experience with kernel profiling and performance tuning (e.g., Nsight, nvprof, nsys)