Tech Lead, Research Scientist/Engineer - AI Infrastructure

ByteDance
圣何塞2025-05-28算法

About the job

We are seeking a Tech Lead to provide technical stewardship in defining and building the next generation of AI infrastructure. You will help build the technical roadmap at the intersection of AI models, software systems, and emerging hardware, architecting the infrastructures that ensure reliable, efficient, and scalable AI at ByteDance.

Responsibilities

AI Infrastructure Architecture

Design and evaluate scalable infrastructure architectures for large-scale ML workloads across compute, storage, and networking. Develop technical proposals and specifications that guide next-generation AI infrastructure systems.

Research & Technology Exploration

Track emerging trends in AI systems, distributed computing, and hardware acceleration. Conduct technical investigations and prototypes, and share insights through technical reports and presentations.

Performance & System Optimization

Analyze and optimize performance across the ML infrastructure stack—including scheduling, networking, storage, and training frameworks—through benchmarking, experimentation, and bottleneck analysis.

Cross-Team Technical Alignment

Work across research and engineering teams to translate AI workload requirements into scalable infrastructure solutions, providing architectural guidance and driving cross-team technical initiatives.

Qualifications

Minimum

- Master's degree or PhD in Computer Science, Electrical Engineering, or a related technical field.

- Strong proficiency in integrating AI tools into knowledge discovery and research workflows.

- 5 years of experience in distributed systems, infrastructure engineering, or ML systems. Experienced at evaluating trade-offs across hardware, software, and algorithms.

- Excellent communication skills to collaborate across teams.

Preferred

- Experience with large-scale model training and inference, including distributed training, KV cache–aware serving, GPU/accelerator optimization, and high-performance networking (e.g., RDMA, NCCL).

- Experience with heterogeneous AI compute systems, large-scale training clusters, HPC-style distributed workloads, and data pipelines for large model training and evaluation.

- Publications in systems and/or machine learning conferences (e.g., NeurIPS, OSDI, SOSP, ASPLOS, MLSys).

- Contributions to open-source projects.