About the job
We are recruiting top research engineers in the Autonomous Vehicles Research team at NVIDIA with strong expertise in software engineering and in artificial intelligence topics, such as deep learning, reinforcement learning, and generative modeling. You must have strong programming skills, a solid track record of training deep learning models at scale, and a good mathematical foundation to analyze new AI algorithms. We focus on AI models for autonomous driving such as agent behavior models, end-to-end AV architectures, AI safety, closed-loop training approaches, and AV foundation models (VLMs, reasoning models, etc.). We will be publishing at top venues and working with the broader scientific community. Communicating with different teams and domain scientists in different areas is essential.
Responsibilities
Develop large-scale supervised learning and reinforcement learning training frameworks to support multi-modal foundation models for AVs capable of running on thousands of GPUs;
Optimize GPU and cluster utilization for efficient model training and fine-tuning on massive datasets;
Implement scalable data loaders and preprocessors tailored for multimodal datasets, including videos, text, and sensor data;
Build and optimize simulation infrastructure (based on GPU-accelerated simulators) to support the training of driving policies for AVs at scale;
Collaborate with researchers to integrate cutting-edge model architectures into scalable training pipelines.
Develop sim-to-real transfer pipelines and work closely with the AV product team to deploy to real-world cars;
Propose scalable solutions that combine LLMs with policy learning.
Apply reinforcement learning to finetune multimodal LLMs.
Develop robust monitoring and debugging tools to ensure the reliability and performance of training workflows on large GPU clusters.
Qualifications
Minimum
Bachelor's degree in Computer Science, Robotics, Engineering, or a related field or equivalent experience.
10+ years of full-time industry experience in large-scale MLOps and AI infrastructure.
Proven experience designing and optimizing distributed training systems with frameworks like PyTorch, JAX, or TensorFlow.
Deep familiarity with reinforcement learning algorithms like PPO, SAC, or Q-learning, including experience tuning hyperparameters and reward functions.
Familiarity with common policy learning techniques like reward shaping, domain randomization, curriculum learning.
Deep understanding of GPU acceleration, CUDA programming, and cluster management tools like Kubernetes.
Strong programming skills in Python and a high-performance language such as C++ for efficient system development.
Strong experience with large-scale GPU clusters, HPC environments, and job scheduling/orchestration tools (e.g., SLURM, Kubernetes).
Preferred
No preferred qualifications listed.