Visual Text Compression as Measure Transport

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing vision-language compression methods struggle to reliably predict downstream task performance and lack effective metrics for task-relevant information loss. This work addresses these limitations by formulating the problem as a measure transport task, leveraging the pushforward mapping induced by a ViT patch encoder to decompose information loss into precision cost and coverage cost. Based on this decomposition, the authors propose an unsupervised routing criterion and a transport-aware foveation recoding mechanism. Evaluated across 24 NLP datasets, this label-free strategy matches the performance of the best label-dependent methods on 17 datasets, achieves an average task score improvement of 3.3%, and reduces token usage by 10.3%.

📝 Abstract

Visual text compression (VTC) promises efficient long-context processing by rendering text into an image and re-encoding it with a vision-language model, often producing $3$--$20\times$ fewer decoder tokens than subword tokenization. Yet token savings do not translate predictably into downstream utility: on some tasks the visual path matches or exceeds the text path, on others it collapses, and the compression ratio itself does not predict which regime will occur. The missing quantity is therefore not another summary of efficiency, but a principled measure of task-relevant information loss induced by visual encoding. We address this problem by formulating VTC in the language of measure transport. Treating text and visual tokens as empirical probability measures, we show that the ViT patch encoder induces a push-forward map whose transport cost decomposes into a precision cost from within-patch aggregation and a coverage cost from cross-patch fragmentation. Both terms are estimable from downstream-label-free probes. This formulation yields two operational consequences: a downstream-label-free routing criterion that selects whether to use the visual path for a given input or benchmark instance, and a transport-informed foveation mechanism that re-encodes high-cost regions at higher resolution. Across $24$ NLP datasets at Qwen3-4B, our label-free rule matches the per-dataset oracle on $17/24$ datasets ($70.8\%$), and improves the average task score by $+3.3\%$ with $-10.3\%$ average tokens relative to a pure-LLM.

Problem

Research questions and friction points this paper is trying to address.

visual text compression

information loss

measure transport

vision-language models

token efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual text compression

measure transport

push-forward map