DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the excessive memory and computational overhead of large vision-language models (LVLMs) when deployed on edge devices, this paper proposes DocSLM—a lightweight long-document understanding model. Methodologically, DocSLM introduces (1) a hierarchical multimodal compressor that jointly encodes visual, textual, and layout modalities to preserve both local details and global semantics; and (2) an entropy-based streaming token pruning mechanism that dynamically discards redundant visual tokens while calibrating response uncertainty. Evaluated on multiple long multimodal document benchmarks, DocSLM achieves state-of-the-art performance using only a fraction of visual tokens and model parameters: it reduces visual tokens by 82%, model parameters by 75%, and end-to-end latency by 71%. To our knowledge, DocSLM is the first framework enabling efficient and robust long-document understanding on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory footprint for multimodal document understanding

Enabling efficient processing on resource-constrained edge devices

Maintaining performance while compressing visual and textual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Multimodal Compressor encodes visual, textual, layout information

Streaming Abstention mechanism processes document segments sequentially

Reduces visual tokens, parameters, and latency for edge devices

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs