CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the significant memory and latency bottlenecks in vision-language models caused by high-resolution visual tokens, which existing pruning methods fail to alleviate effectively. The authors propose CIVIC, a framework that, for the first time, enables end-to-end continuous compact sequence representations spanning the vision encoder, projection layer, LLM prefill phase, and KV cache, thereby eliminating discontinuous memory accesses and per-token decompression overhead. By leveraging text-aligned KL divergence distillation and an adaptive spatial retention threshold, CIVIC preserves geometric structure, fine-grained localization capability, and semantic integrity during compression. Evaluated on Qwen3-VL, CIVIC reduces KV cache memory to approximately one-third of the baseline while substantially lowering end-to-end latency, all without compromising accuracy on multimodal reasoning and visual grounding tasks.

📝 Abstract

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

token reduction

memory bottleneck

inference latency

sequence compactness

Innovation

Methods, ideas, or system contributions that make the work stand out.

sequence compactness

vision-language models

KV-cache compression