PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the quadratic computational bottleneck in large vision-language models during inference, caused by dense visual token sequences, and the difficulty of existing elastic compression methods to preserve both spatial detail and semantic fidelity under strong compression. The authors propose a novel visual tokenization architecture that dynamically partitions feature extraction tasks: low-frequency spatial layouts are modeled via pooling anchors, which then conditionally guide elastic query tokens to focus on complementary rather than redundant information. By uniquely integrating anchor-based resampling with conditional elastic queries, the method effectively mitigates spectral aliasing and spatial misalignment. It achieves state-of-the-art performance across 27 benchmarks, significantly outperforming existing Matryoshka approaches, advances the Pareto frontier of accuracy-efficiency trade-offs, and enables a unified “train-once, deploy-anywhere” framework supporting arbitrary token budgets.

📝 Abstract

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

Problem

Research questions and friction points this paper is trying to address.

visual-token compression

spectral aliasing

spatial grounding

vision-language models

computational bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

elastic token compression

pool-anchored resampling

conditioned queries