Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses enterprise-level Visual Document Understanding (VDU), a challenging task involving information extraction and reasoning across heterogeneous visual document formats—including tables, charts, diagrams, infographics, and sketches. We propose the first decoder-only vision-aligned architecture for VDU. Our method integrates a 2B-parameter Granite large language model with a lightweight vision encoder, employs document-specialized instruction fine-tuning, and introduces a novel test-time safety classification mechanism based on sparse attention vectors—ensuring both model efficiency (<3B parameters) and inference robustness. Evaluated on standard VDU benchmarks and the contamination-resistant LiveXiv benchmark, our approach achieves state-of-the-art performance. All model weights, training data, and implementation details are publicly released under the Apache-2.0 license.

Technology Category

Application Category

📝 Abstract

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.

Problem

Research questions and friction points this paper is trying to address.

Develops lightweight multimodal model

Enhances enterprise visual document understanding

Ensures safety in model inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight multimodal model

visual document understanding

safety classification approach

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs