Images are Worth Variable Length of Representations

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Existing vision encoders represent images using fixed-length token sequences, ignoring inter-image variations in information content—leading to redundant and inefficient encoding. To address this, we propose DOVE (Dynamic Vision Encoder), the first framework to introduce an image-information-driven, variable-length token generation paradigm. DOVE employs a variational autoencoder architecture to enable differentiable control over token count; it further incorporates query-conditioned tokenization and attention-based gating to support semantic-aware, region-focused encoding. Experiments demonstrate that DOVE achieves comparable or superior reconstruction quality while significantly reducing average token count. In linear probing and multimodal downstream tasks—including visual question answering, captioning, and classification—DOVE consistently outperforms state-of-the-art fixed-length encoders using fewer tokens. These results validate both the effectiveness and generalizability of dynamic, information-adaptive visual representation.

Technology Category

Application Category

📝 Abstract

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

Problem

Research questions and friction points this paper is trying to address.

Dynamic vision encoder for variable-length image representations

Reduces token count while maintaining reconstruction quality

Query-conditioned tokenization for targeted semantic extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic vision encoder with variable tokens

Query-conditioned tokenization for targeted extraction

Fewer tokens maintain high reconstruction quality

🔎 Similar Papers

No similar papers found.