🤖 AI Summary
To address the challenges of high visual token redundancy and the difficulty in jointly optimizing generative and discriminative capabilities in Large Vision-Language Models (LVLMs), this paper proposes a Dual-Forward Bottleneck Compression framework. In the first forward pass, compact visual tokens are generated via self-supervised summarization; in the second, joint fine-tuning with language instructions and contrastive loss enhances discriminative capacity, while stage-adaptive adapters enable task-specific adaptation. Built natively atop LVLM architectures, our lightweight self-supervised compression pipeline unifies autoregressive reconstruction, contrastive learning, and adapter-based fine-tuning. Under near-lossless representation preservation, the method achieves 2× visual token compression for generative tasks without performance degradation. Moreover, it establishes new state-of-the-art results on discriminative tasks—including image retrieval and compositional reasoning—marking the first unified visual token compression paradigm that simultaneously excels in both generative and discriminative objectives while ensuring efficiency and scalability.
📝 Abstract
In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a"double-forward pass"training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.