Senior Deep Learning Scientist, Multimodal Conversational AI

Nvidia
US, CA, Santa Clara2026-03-04onsite

About the job

NVIDIA is hiring Senior Deep Learning Scientists interested in streaming multimodal conversational AI, including speech, audio, vision, voice chat, and action, as well as human-AI interaction. You will demonstrate foundational expertise in deep learning, reinforcement learning, computational statistics, and applied mathematics. You will have a chance to define core algorithmic improvements and scale your ideas through our Nemotron platform. You will work on high-impact, high-visibility large language model products that improve the experience for millions of users. If you are creative and passionate about real-world conversational AI issues, come join our Nemotron LLM team.

Responsibilities

Develop, Train, Fine-tune, and Deploy streaming large language models to power multimodal conversational AI systems encompassing multimodal understanding, speech synthesis, speech-to-speech conversation, video generation, UI and animation rendering and control, environment interaction, and dialog reasoning and tool systems

Apply brand-new fundamental and applied research to develop products for multimodal conversational artificial intelligence

Apply techniques such as instruction tuning and reinforcement learning from human feedback (RLHF), reinforcement learning with verifiable reward (RLVR), and parameter-efficient finetuning methods like p-tuning, adapters, and LoRA. These methods improve embodied conversational LLMs for multiple use cases.

Lead the collection, development, and labeling of domain-specific datasets to train LLMs for various multimodal tasks and applications

Measure and benchmark model and application performance. Analyze model accuracy and bias and recommend the next course of action & improvements. Collaborate with various teams on new product features and improvements of existing products

Participate in developing and reviewing code, building documents, and conducting use case reviews and test plan reviews. Help innovate, identify problems, recommend solutions, and perform triage in a collaborative team environment

Qualifications

Minimum

Master’s degree (or equivalent experience) or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, or Applied Math with 8+ years of experience

Excellent programming skills in Python with strong fundamentals in programming, optimizations, and software development

Strong knowledge of ML/DL techniques, algorithms, and tools with exposure to CNN, RNN (LSTM), Transformers (ViT, BERT, BART, GPT/T5, Megatron, LLMs, MoEs)

Experience with training real-time audio language, streaming visual language, and streaming real-time audio-visual language models, and ViT, BERT, GPT, and Nemotron Models for different computer vision, NLP, and dialog system tasks using “PyTorch” Deep Learning Frameworks and performing data wrangling, tokenization, and multimodal alignment

Practical experience in natural language processing, speech/audio processing, computer vision, machine learning, and human-AI interaction

Hands-on experience on conversational AI Technologies like Natural Language Understanding, Natural Language Generation, Dialog systems (including system integration, state tracking, and action prediction), Information retrieval, Question and Answering, Machine Translation, etc.

Understanding of model development life cycle and experience with model development workflows & traceability, and versioning of datasets, including know-how of database management and queries (in SQL, MongoDB, etc.)

Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment

Preferred

Native or near-native fluency is required in one of these non-English languages: Spanish, Mandarin, German, Japanese, Russian, French, UK English, Arabic, Korean, Italian, or Portuguese.

Verified background in building LLMs that incorporate knowledge discovery along with reasoning abilities, including disambiguation, clarification, anticipation, and effective error handling for embodied AI systems

Validated experience adapting LLMs to different domains such as gaming, virtual assistants, video conferencing, and so on

Contributing experience in integrating embodied AI systems with various sensor inputs (camera, microphone, torch, and so on) and backend action fulfillment systems

Experience with long-term reasoning for embodied AI tasks (navigation, mobile manipulation, instruction following, and collaboration with humans) in gaming/physical environments, given natural-language instructions.