DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive memory and computational overhead of large vision-language models (LVLMs) when deployed on edge devices, this paper proposes DocSLM—a lightweight long-document understanding model. Methodologically, DocSLM introduces (1) a hierarchical multimodal compressor that jointly encodes visual, textual, and layout modalities to preserve both local details and global semantics; and (2) an entropy-based streaming token pruning mechanism that dynamically discards redundant visual tokens while calibrating response uncertainty. Evaluated on multiple long multimodal document benchmarks, DocSLM achieves state-of-the-art performance using only a fraction of visual tokens and model parameters: it reduces visual tokens by 82%, model parameters by 75%, and end-to-end latency by 71%. To our knowledge, DocSLM is the first framework enabling efficient and robust long-document understanding on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory footprint for multimodal document understanding
Enabling efficient processing on resource-constrained edge devices
Maintaining performance while compressing visual and textual information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Multimodal Compressor encodes visual, textual, layout information
Streaming Abstention mechanism processes document segments sequentially
Reduces visual tokens, parameters, and latency for edge devices
🔎 Similar Papers
No similar papers found.