Window Token Concatenation for Efficient Visual Large Language Models

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address visual token redundancy in Vision-Language Large Models (VLLMs)—which impairs inference efficiency and hinders fine-grained visual understanding—this paper proposes WiCo, a sliding-window-based visual token concatenation method that adaptively aligns features within each window to preserve critical spatial structure while compressing token count. We further introduce WiCo+, which decouples visual tokens in later LLM layers to enhance fine-grained semantic modeling. Our approach innovatively integrates fine-tuning of the vision encoder’s final layer with token decomposition in intermediate LLM layers, and is trained end-to-end on the LLaVA-1.5 and Shikra frameworks. Experiments demonstrate that WiCo and WiCo+ consistently outperform existing token compression methods across both coarse- and fine-grained visual understanding tasks, reducing GPU memory consumption by up to 42% and inference latency by up to 38%. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.

Problem

Research questions and friction points this paper is trying to address.

Reduce visual tokens in VLLMs efficiently

Maintain fine details in token concatenation

Enhance fine-grained visual understanding performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding window concatenates adjacent visual tokens

Fine-tunes vision encoder for adaptive token adjustment

Decomposes tokens in later layers for fine-grained tasks

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models