🤖 AI Summary
This work addresses the significant memory and latency bottlenecks in vision-language models caused by high-resolution visual tokens, which existing pruning methods fail to alleviate effectively. The authors propose CIVIC, a framework that, for the first time, enables end-to-end continuous compact sequence representations spanning the vision encoder, projection layer, LLM prefill phase, and KV cache, thereby eliminating discontinuous memory accesses and per-token decompression overhead. By leveraging text-aligned KL divergence distillation and an adaptive spatial retention threshold, CIVIC preserves geometric structure, fine-grained localization capability, and semantic integrity during compression. Evaluated on Qwen3-VL, CIVIC reduces KV cache memory to approximately one-third of the baseline while substantially lowering end-to-end latency, all without compromising accuracy on multimodal reasoning and visual grounding tasks.
📝 Abstract
Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.