FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the high computational cost of existing vision-language models in document OCR, which stems from processing a large number of visual tokens, while static pruning risks permanent loss of critical information. The authors propose FastOCR, the first framework to introduce dynamic visual foveation for OCR acceleration. It employs a training-free, KV cache–based dynamic pruning strategy that adaptively and locally selects key visual tokens during decoding for attention computation. By integrating Focal-Guided Pruning to prioritize important layers and Cross-Step Fixation Reuse to propagate attended regions across decoding steps, FastOCR effectively balances efficiency and accuracy. Evaluated on five mainstream vision-language models—including Qwen2.5-VL—the method retains 98% of the original accuracy using only 5% of visual tokens, achieving a 3× reduction in attention latency.

📝 Abstract

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

Problem

Research questions and friction points this paper is trying to address.

Optical Character Recognition

Vision-Language Models

KV Cache Pruning

Visual Token

Document Parsing

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV Cache Pruning

Dynamic Visual Fixation

Vision-Language Models