🤖 AI Summary
This work addresses enterprise-level Visual Document Understanding (VDU), a challenging task involving information extraction and reasoning across heterogeneous visual document formats—including tables, charts, diagrams, infographics, and sketches. We propose the first decoder-only vision-aligned architecture for VDU. Our method integrates a 2B-parameter Granite large language model with a lightweight vision encoder, employs document-specialized instruction fine-tuning, and introduces a novel test-time safety classification mechanism based on sparse attention vectors—ensuring both model efficiency (<3B parameters) and inference robustness. Evaluated on standard VDU benchmarks and the contamination-resistant LiveXiv benchmark, our approach achieves state-of-the-art performance. All model weights, training data, and implementation details are publicly released under the Apache-2.0 license.
📝 Abstract
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.