NVIDIA Nemotron Nano V2 VL

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low inference throughput and poor modeling efficiency in long-document and long-video understanding, as well as multi-step reasoning tasks under real-world conditions, this paper proposes a hybrid Mamba-Transformer vision-language architecture that synergistically integrates the linear-complexity sequential modeling capability of state space models with the strong representational power of Transformers. We introduce a lightweight token compression mechanism that significantly reduces sequence length while preserving critical semantic content. Furthermore, we employ multi-precision quantization (BF16/FP8/FP4) combined with a large-scale, multimodal data-driven customized training strategy. Experiments demonstrate that our approach achieves state-of-the-art performance across long-sequence tasks—including document understanding and video temporal reasoning—while improving inference throughput by 2.3× over existing SOTA models. To foster reproducibility and further research, we publicly release multi-precision model weights, partial training code, and curated datasets.

Technology Category

Application Category

📝 Abstract
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron vision-language series designed for strong real-world document understanding, long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios. We are releasing model checkpoints in BF16, FP8, and FP4 formats and sharing large parts of our datasets, recipes and training code.
Problem

Research questions and friction points this paper is trying to address.

Enhancing real-world document understanding capabilities
Improving long video comprehension and reasoning tasks
Increasing inference throughput for multimodal scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer LLM architecture
Innovative token reduction techniques
Enhanced datasets and training recipes
🔎 Similar Papers
No similar papers found.
N
N. Deshmukh
NVIDIA
Kateryna Chumachenko
Kateryna Chumachenko
Research Scientist, NVIDIA
T
Tuomas Rintamaki
NVIDIA
M
Matthieu Le
NVIDIA
T
Tyler Poon
NVIDIA
D
Danial Mohseni Taheri
NVIDIA
Ilia Karmanov
Ilia Karmanov
Nvidia
Computer Vision
Guilin Liu
Guilin Liu
Research Scientist, NVIDIA
Computer VisionDeep LearningGenerative Models
J
Jarno Seppanen
NVIDIA
G
Guo Chen
NVIDIA
Karan Sapra
Karan Sapra
Clemson University, NVIDIA
Deep LearningHigh Performance ComputingImage ProcessingGenomicsCoexpression Networks
Z
Zhi-Wei Yu
NVIDIA
A
Adi Renduchintala
NVIDIA
Charles Wang
Charles Wang
Professor/Director, Center for Genomics, Loma Linda University
Peter Jin
Peter Jin
UC Berkeley
Machine LearningArtificial Intelligence
Arushi Goel
Arushi Goel
Research Scientist, NVIDIA
Computer VisionMachine LearningVision and Language
Mike Ranzinger
Mike Ranzinger
Research Scientist, NVIDIA
L
Lukas Voegtle
NVIDIA
Philipp Fischer
Philipp Fischer
University of Freiburg
Computer Vision
T
Timo Roman
NVIDIA
Wei Ping
Wei Ping
Distinguished Research Scientist, NVIDIA
machine learninglarge language modelsspeech synthesisreinforcement learning
B
Bo Wang
NVIDIA
Z
Zhuolin Yang
NVIDIA
Nayeon Lee
Nayeon Lee
School of Computing, KAIST
AI EthicsCross-cultural NLPcomputational social scienceNLP
Shaokun Zhang
Shaokun Zhang
The Pennsylvania State University
Language AgentReinforcement Learning
Fuxiao Liu
Fuxiao Liu
Research Scientist, NVIDIA
Multi-Modal LearningMLLMHallucination
Zhiqi Li
Zhiqi Li
PhD, Nanjing University
computer vision
D
Di Zhang
NVIDIA
G
Gregorio Heinrich
NVIDIA
H
Hongxu Yin
NVIDIA
S
Song Han
NVIDIA
Pavlo Molchanov
Pavlo Molchanov
NVIDIA Research
AIMachine LearningEfficient Deep LearningSemi-supervised learningnetwork inversion
P
Parth Mannan
NVIDIA
Y
Yaohui Xu
NVIDIA
J
Jane Polak Scowcroft
NVIDIA
T
Tom Balough
NVIDIA
S
Subhashree Radhakrishnan
NVIDIA
P
Paris Zhang
NVIDIA
S
Sean Cha
NVIDIA
Ratnesh Kumar
Ratnesh Kumar
NVIDIA
Z
Zaid Pervaiz Bhat
NVIDIA
J
Jian Zhang
NVIDIA
D
Darragh Hanley
NVIDIA
P
Pritam Biswas
NVIDIA
J
J. Oliver
NVIDIA
K
Kevin Vasques
NVIDIA
Roger Waleffe
Roger Waleffe
University of Wisconsin-Madison
Machine LearningGraph LearningSystemsPlasma Physics
D
Duncan Riach
NVIDIA
O
Oluwatobi Olabiyi
NVIDIA
Ameya Sunil Mahabaleshwarkar
Ameya Sunil Mahabaleshwarkar
Deep Learning Scientist, NVIDIA
Deep LearningNatural Language ProcessingLarge Language ModelsSmall Language Models
Bilal Kartal
Bilal Kartal
NVIDIA
AIDeep LearningReinforcement LearningMulti-Agent Systems
Pritam Gundecha
Pritam Gundecha
NVIDIA
K
Khanh Nguyen
NVIDIA
A
Alexandre Milesi
NVIDIA
E
Eugene Khvedchenia
NVIDIA
R
Ran Zilberstein
NVIDIA
O
Ofri Masad
NVIDIA
N
Natan Bagrov
NVIDIA
N
Nave Assaf
NVIDIA
T
Tomer Asida
NVIDIA
D
Daniel Afrimi
NVIDIA
A
Amit Zuker
NVIDIA
N
Netanel Haber
NVIDIA
Z
Zhiyu Cheng
NVIDIA
J
Jingyu Xin
NVIDIA
D
Di Wu
NVIDIA
N
Nik Spirin
NVIDIA
Maryam Moosaei
Maryam Moosaei
NVIDIA
R
Roman Ageev
NVIDIA
V
Vanshil Atul Shah
NVIDIA
Y
Yuting Wu
NVIDIA
Daniel Korzekwa
Daniel Korzekwa
Nvidia
PruningDistillationLLMVLMSpeech
U
Unnikrishnan Kizhakkemadam Sreekumar
NVIDIA
W
Wanli Jiang
NVIDIA
P
Padmavathy Subramanian
NVIDIA
A
Alejandra Rico
NVIDIA
S
Sandip Bhaskar
NVIDIA
Saeid Motiian
Saeid Motiian
NVIDIA
K
Kedi Wu
NVIDIA
A
Annie Surla
NVIDIA
Chia-Chih Chen
Chia-Chih Chen
NVIDIA
H
Hayden Wolff
NVIDIA
M
Matthew I. Feinberg
NVIDIA
M
Melissa Corpuz
NVIDIA
M
Marek Wawrzos
NVIDIA
E
E. Long
NVIDIA
Aastha Jhunjhunwala
Aastha Jhunjhunwala
NVIDIA
P
Paul Hendricks
NVIDIA
F
Farzan Memarian
NVIDIA
B
Benika Hall
NVIDIA
X
Xin‐Yu Wang
NVIDIA
D
David Mosallanezhad
NVIDIA
Soumye Singhal
Soumye Singhal
NVIDIA
Deep LearningNLPArtificial Intelligence
L
L. Vega
NVIDIA
K
Katherine Cheung
NVIDIA
K
Krzysztof Pawelec
NVIDIA
Michael Evans
Michael Evans
NVIDIA
K
K. Luna
NVIDIA
Jie Lou
Jie Lou
Xiaohongshu
AlignmentRLHF
E
Erick Galinkin
NVIDIA
A
Akshay Hazare
NVIDIA
K
Kaustubh Purandare
NVIDIA
A
Ann Guan
NVIDIA
A
Anna Warno
NVIDIA
Chen Cui
Chen Cui
NVIDIA
Yoshi Suhara
Yoshi Suhara
NVIDIA
Natural Language ProcessingMachine LearningComputational Social Science
S
Shibani Likhite
NVIDIA
S
Seph Mard
NVIDIA
M
M. Price
NVIDIA
L
Laya Sleiman
NVIDIA
S
Saori Kaji
NVIDIA
U
Udi Karpas
NVIDIA
K
Kari Briski
NVIDIA
J
Joey Conway
NVIDIA
M
Michael Lightstone
NVIDIA
Jan Kautz
Jan Kautz
Vice President of Research, NVIDIA Research
Computer VisionMachine LearningVisual Computing
Mohammad Shoeybi
Mohammad Shoeybi
Senior Director of Applied Research at NVIDIA
Large Language ModelsNLPMulti-Modal ModelsGenerative AI
Mostofa Patwary
Mostofa Patwary
Director, Applied Deep Learning Research, NVIDIA
Natural Language ProcessingLarge Scale Deep LearningHigh Performance ComputingParallel
J
Jon Cohen
NVIDIA
Oleksii Kuchaiev
Oleksii Kuchaiev
NVIDIA
machine learningdeep learninggraph theorybioinformatics
Andrew Tao
Andrew Tao
Nvidia
Computer VisionMachine Learning
Bryan Catanzaro
Bryan Catanzaro
NVIDIA
Parallel ComputingMachine Learning