Research Engineer – Multimodal Training Infrastructure (Seed Infra)

About the job

The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

- Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models

- Design and optimize distributed training strategies for multimodal LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters

- Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads

- Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements

- Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods

- Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world infrastructure solutions

Qualifications

Minimum

- Deep expertise in large-scale distributed training of LLMs and multimodal models

- Strong systems research background with demonstrated ability to design, build, and optimize large-scale ML systems

- Proven experience with parallelism strategies (e.g., data, model, pipeline, expert parallelism) and performance optimization on large GPU clusters

- Strong programming skills and hands-on experience implementing production-grade ML systems or infrastructure

- Solid understanding of algorithm–system co-design and cross-layer optimization for training efficiency, scalability, and reliability

Preferred

No preferred qualifications listed.