When LLaVA Meets Objects: Token Composition for Vision-Language-Models

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the high computational cost of autoregressive vision-language models during inference, which stems from their reliance on a large number of visual tokens. To mitigate this, the authors propose Mask-LLaVA, a novel framework that introduces mask-based object representations into vision-language modeling for the first time. By integrating global, local patch, and mask-derived object tokens, Mask-LLaVA constructs a multi-granular yet compact visual representation. This design enables dynamic token pruning at inference time without requiring model retraining, allowing flexible reduction of token count based on computational constraints. Experimental results demonstrate that Mask-LLaVA achieves performance comparable to the original LLaVA on standard benchmarks while significantly improving inference efficiency.

Technology Category

Application Category

📝 Abstract

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

visual tokens

inference efficiency

token compression

autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-LLaVA

token composition

vision-language models