NVIDIA Nemotron Parse 1.1

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in vision-intensive document parsing—including low OCR accuracy, difficulty in structural reconstruction (e.g., Markdown, tables, charts), and output-length limitations for long sequences—this paper proposes TC-Doc, a lightweight multimodal encoder-decoder model. TC-Doc jointly models language and vision using a 256M-parameter language decoder and a compact visual encoder, innovatively unifying bounding-box detection and semantic classification into a single end-to-end framework for text recognition, layout understanding, and format restoration. Architectural optimizations overcome sequence-length bottlenecks, yielding an 885M-parameter flagship model and a 20% faster lightweight TC variant. On major benchmarks, TC-Doc achieves state-of-the-art accuracy among lightweight models. We publicly release model weights, NVIDIA NIM-deployable containers, and partial training data, significantly advancing efficient, edge-deployable document parsing.

Technology Category

Application Category

📝 Abstract
We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.
Problem

Research questions and friction points this paper is trying to address.

Improves document parsing and OCR capabilities for various content types
Extracts text and semantic classes with bounding boxes from documents
Provides lightweight OCR solution with competitive accuracy and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight document parsing and OCR model
Encoder-decoder architecture with 885M parameters
Improved OCR, markdown formatting, and table parsing
🔎 Similar Papers
No similar papers found.
Kateryna Chumachenko
Kateryna Chumachenko
Research Scientist, NVIDIA
Amala Sanjay Deshmukh
Amala Sanjay Deshmukh
NVIDIA
J
Jarno Seppanen
NVIDIA
Ilia Karmanov
Ilia Karmanov
Nvidia
Computer Vision
Chia-Chih Chen
Chia-Chih Chen
NVIDIA
L
Lukas Voegtle
NVIDIA
Philipp Fischer
Philipp Fischer
University of Freiburg
Computer Vision
M
Marek Wawrzos
NVIDIA
Saeid Motiian
Saeid Motiian
NVIDIA
R
Roman Ageev
NVIDIA
K
Kedi Wu
NVIDIA
A
Alexandre Milesi
NVIDIA
Maryam Moosaei
Maryam Moosaei
NVIDIA
K
Krzysztof Pawelec
NVIDIA
P
Padmavathy Subramanian
NVIDIA
M
M. Samadi
NVIDIA
X
Xin Yu
NVIDIA
C
Celina Dear
NVIDIA
S
Sarah Stoddard
NVIDIA
J
Jenna Diamond
NVIDIA
J
J. Oliver
NVIDIA
L
Leanna Chraghchian
NVIDIA
P
Patrick J. Skelly
NVIDIA
T
Tom Balough
NVIDIA
Y
Yao Xu
NVIDIA
J
Jane Polak Scowcroft
NVIDIA
Daniel Korzekwa
Daniel Korzekwa
Nvidia
PruningDistillationLLMVLMSpeech
D
Darragh Hanley
NVIDIA
S
Sandip Bhaskar
NVIDIA
T
Timo Roman
NVIDIA
Karan Sapra
Karan Sapra
Clemson University, NVIDIA
Deep LearningHigh Performance ComputingImage ProcessingGenomicsCoexpression Networks
Andrew Tao
Andrew Tao
Nvidia
Computer VisionMachine Learning
Bryan Catanzaro
Bryan Catanzaro
NVIDIA
Parallel ComputingMachine Learning