🤖 AI Summary
To address low inference throughput and poor modeling efficiency in long-document and long-video understanding, as well as multi-step reasoning tasks under real-world conditions, this paper proposes a hybrid Mamba-Transformer vision-language architecture that synergistically integrates the linear-complexity sequential modeling capability of state space models with the strong representational power of Transformers. We introduce a lightweight token compression mechanism that significantly reduces sequence length while preserving critical semantic content. Furthermore, we employ multi-precision quantization (BF16/FP8/FP4) combined with a large-scale, multimodal data-driven customized training strategy. Experiments demonstrate that our approach achieves state-of-the-art performance across long-sequence tasks—including document understanding and video temporal reasoning—while improving inference throughput by 2.3× over existing SOTA models. To foster reproducibility and further research, we publicly release multi-precision model weights, partial training code, and curated datasets.
📝 Abstract
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron vision-language series designed for strong real-world document understanding, long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios. We are releasing model checkpoints in BF16, FP8, and FP4 formats and sharing large parts of our datasets, recipes and training code.