Research Engineer - LLM Infra training - Seed Infra

ByteDance
圣何塞2026-04-20研发

About the job

The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

- Conduct research and development on large-scale LLM training infrastructure and efficiency

- Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters

- Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads

- Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements

- Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods

- Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutions

Qualifications

Minimum

- Experience with large-scale distributed training for LLMs

- Strong programming skills in Python and/or C++

- Strong background in ML systems / training infrastructure development

- Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)

- Solid understanding of training stack internals (PyTorch, CUDA, NCCL)

- Experience in performance optimization (memory, communication, throughput)

Preferred

- Hands-on experience with distributed training frameworks and large-scale LLM infrastructure

- Experience leading or mentoring engineering teams or cross-functional projects

- Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys) or strong open-source contributions

- Familiarity with benchmarking AI accelerators or large-scale LLM evaluation (e.g., ByteMLPerf)