NVIDIA Nemotron Parse 1.1

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address challenges in vision-intensive document parsing—including low OCR accuracy, difficulty in structural reconstruction (e.g., Markdown, tables, charts), and output-length limitations for long sequences—this paper proposes TC-Doc, a lightweight multimodal encoder-decoder model. TC-Doc jointly models language and vision using a 256M-parameter language decoder and a compact visual encoder, innovatively unifying bounding-box detection and semantic classification into a single end-to-end framework for text recognition, layout understanding, and format restoration. Architectural optimizations overcome sequence-length bottlenecks, yielding an 885M-parameter flagship model and a 20% faster lightweight TC variant. On major benchmarks, TC-Doc achieves state-of-the-art accuracy among lightweight models. We publicly release model weights, NVIDIA NIM-deployable containers, and partial training data, significantly advancing efficient, edge-deployable document parsing.

Technology Category

Application Category

📝 Abstract

We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

Problem

Research questions and friction points this paper is trying to address.

Improves document parsing and OCR capabilities for various content types

Extracts text and semantic classes with bounding boxes from documents

Provides lightweight OCR solution with competitive accuracy and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight document parsing and OCR model

Encoder-decoder architecture with 885M parameters

Improved OCR, markdown formatting, and table parsing

🔎 Similar Papers

No similar papers found.