PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

๐Ÿ“… 2026-05-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the quadratic computational bottleneck in large vision-language models during inference, caused by dense visual token sequences, and the difficulty of existing elastic compression methods to preserve both spatial detail and semantic fidelity under strong compression. The authors propose a novel visual tokenization architecture that dynamically partitions feature extraction tasks: low-frequency spatial layouts are modeled via pooling anchors, which then conditionally guide elastic query tokens to focus on complementary rather than redundant information. By uniquely integrating anchor-based resampling with conditional elastic queries, the method effectively mitigates spectral aliasing and spatial misalignment. It achieves state-of-the-art performance across 27 benchmarks, significantly outperforming existing Matryoshka approaches, advances the Pareto frontier of accuracy-efficiency trade-offs, and enables a unified โ€œtrain-once, deploy-anywhereโ€ framework supporting arbitrary token budgets.
๐Ÿ“ Abstract
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.
Problem

Research questions and friction points this paper is trying to address.

visual-token compression
spectral aliasing
spatial grounding
vision-language models
computational bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

elastic token compression
pool-anchored resampling
conditioned queries
vision-language understanding
spatial grounding