About the job
We are building next-generation generative foundation models, with a strong focus on diffusion-based and unified generation-understanding architectures, deployed in privacy-sensitive, production environments. This role sits at the intersection of Large-scale model training systems, GPU-first architecture and kernel-level optimization, Diffusion / DiT / unified multimodal foundation models, and Privacy-preserving and compliant training pipelines. You will work on end-to-end training architecture design, from model-parallel execution and GPU efficiency to robust, fault-tolerant, privacy-aware training infrastructure.
Responsibilities
Design and implement high-performance GPU kernels for core components such as: Transformer / Attention / MoE / Diffusion
Perform end-to-end optimization for large model training workloads
Conduct in-depth analysis of GPU execution bottlenecks, including compute, memory, and scheduling
Use and extend Triton / CUDA / CUTLASS, and integrate optimized kernels with PyTorch / XLA / custom runtimes
Collaborate closely with model research teams to: Translate new model architectures into efficient, production-ready implementations
Reproduce, benchmark, and improve state-of-the-art system optimization techniques, validating gains in real training and inference settings
Qualifications
Minimum
Currently pursuing PhD in Computer science, computer engineering, or a related technical discipline.
Solid understanding of GPU architecture and execution models
Proficiency in CUDA C++ or Triton, with the ability to independently write and optimize kernels
Strong familiarity with Transformer / Attention computation patterns and performance bottlenecks
Ability to read, reproduce, and reason about systems papers or open-source implementations
Preferred
Hands-on experience with large-scale model training
Familiarity with PyTorch internals (e.g., Autograd, dispatcher, ATen)
Experience with kernel profiling and performance tuning (e.g., Nsight, nvprof, nsys)
Publications, open-source contributions, or performance benchmark results