CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing visual token compression methods for Large Vision-Language Models (LVLMs) suffer from insufficient high-level semantic modeling, leading to either redundant information or contextual loss—particularly problematic given the quadratic growth in visual tokens with high-resolution images, which severely increases computational and memory overhead. To address this, we propose CORE, an object-centric visual token compression paradigm for LVLMs. CORE introduces segmentation-mask-guided semantic priors to inform token clustering and designs a centroid-guided reordering mechanism to preserve spatial structure, ensuring both semantic integrity and geometric consistency. It employs a lightweight segmentation decoder to generate object masks, yielding compact, interpretable, object-centered representations. Evaluated on six authoritative benchmarks, CORE achieves state-of-the-art performance at fixed compression ratios. Under adaptive compression, retaining only 2.2% of original tokens maintains 97.4% of baseline accuracy—demonstrating substantial gains in efficiency and representation quality.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in vision-language models by compressing visual tokens

Prevents information loss during token merging using object-centric representations

Maintains model performance under extreme compression through semantic-guided sorting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object masks guide token merging for compression

Centroid sorting restores spatial order of tokens

Object-centric representations enable extreme compression efficiency

🔎 Similar Papers

No similar papers found.

Authors to Follow