🤖 AI Summary
To address the excessive memory and computational overhead of large vision-language models (LVLMs) when deployed on edge devices, this paper proposes DocSLM—a lightweight long-document understanding model. Methodologically, DocSLM introduces (1) a hierarchical multimodal compressor that jointly encodes visual, textual, and layout modalities to preserve both local details and global semantics; and (2) an entropy-based streaming token pruning mechanism that dynamically discards redundant visual tokens while calibrating response uncertainty. Evaluated on multiple long multimodal document benchmarks, DocSLM achieves state-of-the-art performance using only a fraction of visual tokens and model parameters: it reduces visual tokens by 82%, model parameters by 75%, and end-to-end latency by 71%. To our knowledge, DocSLM is the first framework enabling efficient and robust long-document understanding on resource-constrained edge devices.
📝 Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.