Research Engineer – Reinforcement Learning (RL) Systems & Infrastructure (Seed Infra)

ByteDance
圣何塞2026-02-25研发

About the job

The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

- Design and build end-to-end reinforcement learning (RL) systems for large-scale models, covering rollout, training, evaluation, and deployment pipelines.

- Develop scalable and fault-tolerant RL infrastructure that operates efficiently under dynamic workloads and heterogeneous compute environments.

- Optimize distributed training performance across GPU clusters, improving throughput, resource utilization, and system stability.

- Collaborate with cross-team researchers on targeted system–algorithm co-design to translate research ideas into robust, production-grade implementations.

- Build tooling, monitoring, and debugging frameworks to ensure reliability and observability of large-scale RL training systems.

Qualifications

Minimum

- Strong background in distributed systems, large-scale ML systems, or deep learning infrastructure

- Experience building or optimizing large-scale training systems (e.g., RL, LLM, multimodal models)

- Solid engineering skills in Python/C++ and familiarity with modern ML stacks (PyTorch, distributed training frameworks, etc.)

- Experience with GPU optimization, parallelism strategies, and system-level performance tuning

- Understanding of reinforcement learning workflows (rollout, policy update, evaluation loops)

Preferred

- Experience with large-scale agent systems

- Familiarity with system design under heterogeneous or dynamic workloads

- Exposure to RL + LLM training or post-training pipelines