🤖 AI Summary
This work proposes an efficient and open four-modality large language model that natively supports text, image, video, and audio inputs. Built upon the Nemotron 3 Nano 30B-A3B backbone, it introduces native audio processing capability for the first time in the Nemotron series and incorporates an innovative multimodal token compression technique. This approach significantly reduces inference latency and enhances throughput while preserving strong cross-modal understanding performance. The model leverages an efficient Transformer architecture and supports multiple numerical precisions, including BF16, FP8, and FP4. Partial model weights, training data, and code are publicly released. Comprehensive evaluations demonstrate consistent accuracy improvements over prior models across diverse tasks, including real-world document understanding, long-form audiovisual analysis, and embodied agent operations.
📝 Abstract
We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.