UIPress: Bringing Optical Token Compression to UI-to-Code Generation

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
This work addresses the challenge of high prefill latency in UI-to-Code tasks caused by the large number and uneven information density of visual tokens, which existing compression methods struggle to handle effectively. The authors introduce, for the first time, a learnable optical compression paradigm by inserting a lightweight compression module between a frozen ViT encoder and an LLM decoder. This module employs depthwise separable convolutions, spatial reweighting, and a Transformer-based refinement mechanism to adaptively compress approximately 6,700 visual tokens into a fixed-length sequence of 256 tokens. To bridge the resulting representation gap, the decoder is fine-tuned using LoRA. Evaluated on the Design2Code dataset, the method achieves a CLIP score of 0.8127—representing a 7.5% improvement over the uncompressed baseline—and reduces first-token inference latency by 9.1×.

Technology Category

Application Category

📝 Abstract
UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.
Problem

Research questions and friction points this paper is trying to address.

UI-to-Code generation
visual token compression
prefill latency
information density
optical compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

optical token compression
UI-to-Code generation
learned compression
vision-language models
prefill latency reduction