Research Engineer - LLM Training Infrastructure

About the job

The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Conduct research and development on large-scale LLM training infrastructure and efficiency

Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters

Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads

Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements

Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods

Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutions

Qualifications

Minimum

Experience with large-scale distributed training for LLMs

Strong programming skills in Python and/or C++

Strong background in ML systems / training infrastructure development

Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)

Solid understanding of training stack internals (PyTorch, CUDA, NCCL)

Experience in performance optimization (memory, communication, throughput)

Preferred

Hands-on experience with distributed training frameworks and large-scale LLM infrastructure

Experience leading or mentoring engineering teams or cross-functional projects

Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys) or strong open-source contributions