About the job
We are looking for a Senior Researcher - GPU Performance – Hardware/Software Codesign researcher to explore hardware/kernel-level optimizations to deliver significant efficiency gains for Large Language Models and Generative AI experiences. The qualified candidate will have a solid background in GPU architecture, accelerator design, machine learning, or systems research and the ambition to apply them to large scale production systems. This role combines deep technical knowledge in GPU architecture with practical implementation skills to create efficient, scalable computational kernels. Further, the qualified candidate is expected to demonstrate a history of solving hard technical problems and is motivated to tackle the hardest problems in building a full end-to-end AI stack. An entrepreneurial approach and ability to take initiative and move fast are essential.
Responsibilities
Design, implement, and optimize GPU kernels for complex computational workloads such as AI inferencing.
Research and develop novel optimization techniques for generation of GPU kernels.
Profile and analyze kernel performance using advanced diagnostic tools.
Generate automated solutions for kernel optimization and tuning.
Collaborate with other researchers to improve model performance.
Document optimization strategies and maintain performance benchmarks.
Contribute to the development of internal GPU computing frameworks.
Qualifications
Minimum
Doctorate in relevant field OR equivalent experience.
2+ years of experience in GPU architecture, memory hierarchies, parallel computing and algorithm optimization.
2+ years of experience in GPU programming, including performance profiling and optimization tools.
Reliable C++ programming skills.
Preferred
5+ years of experience in GPU programming and optimization, expert knowledge of CUDA, ROCm, Triton, PTX, CUTLASS, or similar GPU programming frameworks.
Experience with machine learning frameworks (PyTorch, TensorFlow).
Familiarity with compiler optimization techniques and background in auto-tuning and automated code generation.
Publication record in relevant conferences or journals (MLSys, NeurIPS, ICML, ICLR, AISTATS, ACL, EMNLP, NAACL, ISCA, MICRO, ASPLOS, HPCA, SOSP, OSDI, NSDI, etc.)