🤖 AI Summary
To address challenges in vision-intensive document parsing—including low OCR accuracy, difficulty in structural reconstruction (e.g., Markdown, tables, charts), and output-length limitations for long sequences—this paper proposes TC-Doc, a lightweight multimodal encoder-decoder model. TC-Doc jointly models language and vision using a 256M-parameter language decoder and a compact visual encoder, innovatively unifying bounding-box detection and semantic classification into a single end-to-end framework for text recognition, layout understanding, and format restoration. Architectural optimizations overcome sequence-length bottlenecks, yielding an 885M-parameter flagship model and a 20% faster lightweight TC variant. On major benchmarks, TC-Doc achieves state-of-the-art accuracy among lightweight models. We publicly release model weights, NVIDIA NIM-deployable containers, and partial training data, significantly advancing efficient, edge-deployable document parsing.
📝 Abstract
We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.